Skip to content

Commit f38fea9

Browse files
authored
stop using xml sitemap (#517)
* Add sitemap generating machinery * Update site scanner to use sitemap.json file * Remove DMS ping * Fix scan output including sitemap.json
1 parent 678ed70 commit f38fea9

10 files changed

+120
-88
lines changed

.github/workflows/mozorg-site-scanning.yaml

+33-5
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,32 @@ env:
2525
USER_AGENT: Mozilla/5.0 (Automated; https://github.com/mozmeao/www-site-checker) CheckerBot/1.0
2626

2727
jobs:
28+
build_sitemap_json:
29+
name: Build sitemap.json file
30+
runs-on: ubuntu-latest
31+
steps:
32+
- uses: actions/checkout@v4
33+
- name: Generate sitemap
34+
run: sitemap/generate_sitemap_docker.sh
35+
- name: Upload sitemap to cache
36+
uses: actions/upload-artifact@v4
37+
with:
38+
name: sitemap_json
39+
path: sitemap/sitemap.json
2840
run_on_cdn:
41+
needs: build_sitemap_json
2942
strategy:
3043
matrix:
3144
batch: ["1_8", "2_8", "3_8", "4_8", "5_8", "6_8", "7_8", "8_8"]
3245
runs-on: ubuntu-latest
3346
name: Check main CDN - batch ${{ matrix.batch }}
3447
steps:
3548
- uses: actions/checkout@v4
49+
- name: Download sitemap
50+
uses: actions/download-artifact@v4
51+
with:
52+
name: sitemap_json
53+
path: sitemap/
3654
- uses: actions/setup-python@v5
3755
with:
3856
python-version: "3.11"
@@ -41,7 +59,7 @@ jobs:
4159
- name: Run checker
4260
run: >
4361
python bin/scan_site.py
44-
--sitemap-url="https://${{ secrets.MOZORG_CDN_HOSTNAME }}/sitemap.xml"
62+
--hostname="${{ secrets.MOZORG_CDN_HOSTNAME }}"
4563
--batch="${{ matrix.batch }}"
4664
--export-cache
4765
- name: Upload scan output files
@@ -58,13 +76,19 @@ jobs:
5876
retention-days: 5
5977

6078
run_on_us_origin:
79+
needs: build_sitemap_json
6180
strategy:
6281
matrix:
6382
batch: ["1_8", "2_8", "3_8", "4_8", "5_8", "6_8", "7_8", "8_8"]
6483
runs-on: ubuntu-latest
6584
name: Check US origin server - batch ${{ matrix.batch }}
6685
steps:
6786
- uses: actions/checkout@v4
87+
- name: Download sitemap
88+
uses: actions/download-artifact@v4
89+
with:
90+
name: sitemap_json
91+
path: sitemap/
6892
- uses: actions/setup-python@v5
6993
with:
7094
python-version: "3.11"
@@ -73,8 +97,7 @@ jobs:
7397
- name: Run checker
7498
run: >
7599
python bin/scan_site.py
76-
--sitemap-url="https://${{ secrets.MOZORG_US_ORIGIN_HOSTNAME }}/sitemap.xml"
77-
--maintain-hostname
100+
--hostname="${{ secrets.MOZORG_US_ORIGIN_HOSTNAME }}"
78101
--batch="${{ matrix.batch }}"
79102
- name: Upload scan output files
80103
uses: actions/upload-artifact@v4
@@ -84,13 +107,19 @@ jobs:
84107
retention-days: 15
85108

86109
run_on_eu_origin:
110+
needs: build_sitemap_json
87111
strategy:
88112
matrix:
89113
batch: ["1_8", "2_8", "3_8", "4_8", "5_8", "6_8", "7_8", "8_8"]
90114
runs-on: ubuntu-latest
91115
name: Check EU origin server - batch ${{ matrix.batch }}
92116
steps:
93117
- uses: actions/checkout@v4
118+
- name: Download sitemap
119+
uses: actions/download-artifact@v4
120+
with:
121+
name: sitemap_json
122+
path: sitemap/
94123
- uses: actions/setup-python@v5
95124
with:
96125
python-version: "3.11"
@@ -99,8 +128,7 @@ jobs:
99128
- name: Run checker
100129
run: >
101130
python bin/scan_site.py
102-
--sitemap-url="https://${{ secrets.MOZORG_EU_ORIGIN_HOSTNAME }}/sitemap.xml"
103-
--maintain-hostname
131+
--hostname="${{ secrets.MOZORG_EU_ORIGIN_HOSTNAME }}"
104132
--batch="${{ matrix.batch }}"
105133
- name: Upload scan output files
106134
uses: actions/upload-artifact@v4

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@ venv
1414
TODO.txt
1515
output/*
1616
page_cache/*
17+
sitemap/sitemap.json

README.md

+10-8
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,14 @@ You can also clone this repo to your local machine and run it there. This is the
4444
```bash
4545
git clone [email protected]:mozmeao/www-site-checker.git
4646
cd www-site-checker
47+
# you need docker setup for this step
48+
# it creates a sitemap.json file in the sitemap dir
49+
./sitemap/generate_sitemap_docker.sh
4750
# make a virtualenv, with whatever you prefer, and activate it
4851
pip install -r requirements.txt
4952
export ALLOWLIST_FILEPATH=data/allowlist-mozorg.yaml
5053
export EXTRA_URLS_FILEPATH=data/extra-urls-mozorg.yaml
51-
python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml
54+
python bin/scan_site.py --hostname=www.mozilla.org
5255
```
5356

5457
The above will start to work through the entire sitemap (and child sitemaps) at that URL
@@ -57,7 +60,7 @@ If you only want to check a smaller batch of URLs (handy in development), add th
5760

5861
```bash
5962
# Only inspect batch three of all URLs to check, after slicing the site into 40 batches
60-
python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml --batch=3:40
63+
python bin/scan_site.py --hostname=www.mozilla.org --batch=3:40
6164
```
6265

6366
And if you only want to check a specific page, you use the `--specific-url` param. The following, for example, checks the homepage and a Fx mobile downbload page
@@ -69,27 +72,27 @@ python bin/scan_site.py --specific-url=https://www.mozilla.org/en-US/firefox/bro
6972
There is a default allowlist in use (`data/allowlist-mozorg.yaml` - **set via env vars**) but an alernative can be passed via the `--allowlist` param
7073

7174
```bash
72-
python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml --allowlist=/path/to/custom/allowlist.yaml
75+
python bin/scan_site.py --hostname=www.mozilla.org --allowlist=/path/to/custom/allowlist.yaml
7376
```
7477

75-
If you want or need to check a site whose sitemap points to a _different_ domain (eg you want to check an origin server whose sitemap is hard-coded to refer to the CDN domain, or a localhost setup) you should ensure the server is listed as an option in the allowlist and also pass the `--maintain-hostname` parameter.
78+
If you want or need to check a site whose sitemap points to a _different_ domain (eg you want to check an origin server whose sitemap is hard-coded to refer to the CDN domain, or a localhost setup) you should ensure the server is listed as an option in the allowlist.
7679

7780
For example:
7881

7982
```bash
80-
python bin/scan_site.py --sitemap-url=http://origin-server.example.com/sitemap.xml --maintain-hostname
83+
python bin/scan_site.py --hostname=origin-server.example.com
8184
```
8285

8386
or, for localhost
8487

8588
```bash
86-
python bin/scan_site.py --sitemap-url=http://localhost:8000/sitemap.xml --maintain-hostname
89+
python bin/scan_site.py --hostname=localhost:8000
8790
```
8891

8992
If you want to test the Sentry integration locally, you can pass a Sentry DSN as an environment variable. Here, we're passing a URL to [Kent - a local 'fake Sentry'](https://github.com/willkg/kent)
9093

9194
```bash
92-
SENTRY_DSN=http://[email protected]:8011/1 python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml
95+
SENTRY_DSN=http://[email protected]:8011/1 python bin/scan_site.py --hostname=www.mozilla.org
9396
```
9497

9598
## The output files
@@ -125,4 +128,3 @@ If you come across an alert saying there was an unexpected URL detected and you'
125128
* Edit the allowlist - eg `data/allowlist-mozorg.yaml`
126129
* Run the checks locally (see above)
127130
* Push the branch up and raise a PR.
128-

bin/handle_site_scan_output.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ def _assemble_results(output_path: str) -> set:
7878

7979
# assume we have multiple output files, all scanning different sources of the website data
8080
for filename in os.listdir(output_path):
81-
if filename.endswith(".json"):
81+
if filename.startswith("unexpected_urls_for") and filename.endswith(".json"):
8282
unexpected_url_data.update(
8383
_load_results_json(os.path.join(output_path, filename))
8484
)

bin/scan_site.py

+30-74
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
import time
2121
from collections import defaultdict
2222
from functools import cache
23+
from pathlib import Path
2324
from typing import Dict, Iterable, List, Tuple
2425
from urllib.parse import quote, urlparse
2526

@@ -50,6 +51,7 @@
5051
integrations=[sentry_logging],
5152
)
5253

54+
DEFAULT_SITEMAP_FILENAME = "sitemap/sitemap.json"
5355
DEFAULT_BATCH__NOOP = "1_1" # By default treat all URLs as a single batch
5456
UNEXPECTED_URLS_FILENAME_FRAGMENT = "unexpected_urls_for"
5557
URL_RETRY_LIMIT = 3
@@ -63,15 +65,14 @@
6365

6466
@click.command()
6567
@click.option(
66-
"--sitemap-url",
67-
default=None,
68-
help="URL of an XML sitemap to use as source data",
68+
"--hostname",
69+
required=True,
70+
help="Hostname of the site instance to scan",
6971
)
7072
@click.option(
71-
"--maintain-hostname",
72-
default=False,
73-
is_flag=True,
74-
help="If the sitemap points to a different domain (eg a CDN domain), override it and replace it with the hostname that served the sitemap",
73+
"--sitemap-filename",
74+
default=DEFAULT_SITEMAP_FILENAME,
75+
help="Filename of a JSON sitemap file",
7576
)
7677
@click.option(
7778
"--specific-url",
@@ -104,8 +105,8 @@
104105
help="If True, we'll export the cached pages as an artifact to {hostname}-cached-pages/batch{batch-id}, for other checks to use",
105106
)
106107
def run_checks(
107-
sitemap_url: str,
108-
maintain_hostname: bool,
108+
sitemap_filename: str,
109+
hostname: str,
109110
specific_url: Iterable,
110111
batch: str,
111112
allowlist: str,
@@ -114,24 +115,20 @@ def run_checks(
114115
) -> None:
115116
# Let's tidy up that variables we get from the input option
116117
specific_urls = specific_url
118+
sitemap_path = Path(sitemap_filename)
117119

118-
if not sitemap_url and not specific_urls:
120+
if not sitemap_path.exists() and not specific_urls:
119121
raise Exception("No sitemap or input URLs specified. Cannot proceed.")
120122

121-
host_url = (
122-
sitemap_url or specific_urls[0]
123-
) # TODO: ensure all specific URLs use the same hostname
124-
hostname = urlparse(host_url).netloc
125-
126123
allowlist_config = _get_allowlist_config(
127124
hostname,
128125
allowlist_pathname=allowlist,
129126
)
130127

131128
urls_to_check = _build_urls_to_check(
132-
sitemap_url=sitemap_url,
129+
sitemap_path=sitemap_path,
130+
hostname=hostname,
133131
specific_urls=specific_url,
134-
maintain_hostname=maintain_hostname,
135132
)
136133

137134
if additional_urls_file:
@@ -437,80 +434,39 @@ def _check_pages_for_outbound_links(urls: List[str], allowlist_config: Dict) ->
437434

438435

439436
def _build_urls_to_check(
440-
sitemap_url: str,
437+
sitemap_path: Path,
438+
hostname: str,
441439
specific_urls: Iterable,
442-
maintain_hostname: bool,
443440
) -> List[str]:
444441
"""Given a sitemap URL and/or specific URLs to check, put together a list
445442
of overall URLs whose content wen want to check"""
446443

447444
urls = []
448-
if sitemap_url:
449-
urls += _get_urls_from_sitemap(sitemap_url, maintain_hostname)
445+
if sitemap_path:
446+
urls += _get_urls_from_json_file(sitemap_path, hostname)
450447
if specific_urls:
451448
# Don't forget any manually specified URLs
452449
urls += specific_urls
453450
click.echo(f"Discovered {len(urls)} URLs to check")
454451
return urls
455452

456453

457-
def _get_urls_from_sitemap(
458-
sitemap_url: str,
459-
maintain_hostname: bool,
454+
def _get_urls_from_json_file(
455+
sitemap_path: Path,
456+
hostname: str,
460457
) -> List[str]:
461-
"""Extract URLs to explore from a sitemap, optionally ensuring the hostname in
462-
any URLs found is swapped ('maintained') to be the same as that of the source
463-
sitemap –- this is needed when checking an origin server whose sitemap returns
464-
the CDN's hostname"""
465-
466458
urls = []
459+
with sitemap_path.open() as fh:
460+
sitemap = json.load(fh)
467461

468-
_parsed_origin_sitemap_url = urlparse(sitemap_url)
469-
origin_hostname_with_scheme = (
470-
f"{_parsed_origin_sitemap_url.scheme}://{_parsed_origin_sitemap_url.netloc}" # noqa E231
471-
)
462+
for path, locales in sitemap.items():
463+
if not locales:
464+
urls.append(f"https://{hostname}{path}")
465+
continue
466+
467+
for locale in locales:
468+
urls.append(f"https://{hostname}/{locale}{path}")
472469

473-
resp = _get_url_with_retry(sitemap_url)
474-
475-
sitemap_xml = resp.text
476-
soup = BeautifulSoup(sitemap_xml, "lxml")
477-
478-
# Look for a <sitemap> node, and get each as a URL for a locale-specific sitemap
479-
sitemap_nodes = soup.find_all("sitemap")
480-
if len(sitemap_nodes):
481-
click.echo(f"Discovered {len(sitemap_nodes)} child sitemaps")
482-
483-
for sitemap_node in sitemap_nodes:
484-
sitemap_url = sitemap_node.loc.text
485-
486-
if maintain_hostname:
487-
sitemap_url = _update_hostname(
488-
origin_hostname_with_scheme=origin_hostname_with_scheme,
489-
urls=[sitemap_url],
490-
)[0]
491-
492-
click.echo(f"Diving into {sitemap_url}")
493-
urls.extend(_get_urls_from_sitemap(sitemap_url, maintain_hostname))
494-
495-
# look for regular URL nodes, which may or may not co-exist alongside sitemap nodes
496-
url_nodes = soup.find_all("url")
497-
if url_nodes:
498-
click.echo(f"Adding {len(url_nodes)} URLs")
499-
for url in url_nodes:
500-
try:
501-
urls.append(url.loc.text)
502-
except AttributeError as ae:
503-
sentry_sdk.capture_message(
504-
f"URL node {url} missing '<loc>' - exception to follow"
505-
)
506-
sentry_sdk.capture_exception(ae)
507-
508-
# Also remember to update the hostname on the final set of URLs, if required
509-
if maintain_hostname:
510-
urls = _update_hostname(
511-
origin_hostname_with_scheme=origin_hostname_with_scheme,
512-
urls=urls,
513-
)
514470
return urls
515471

516472

sitemap/.bedrock.env

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
DEBUG=False
2+
DEV=False
3+
ALLOWED_HOSTS=*
4+
AWS_DB_S3_BUCKET=bedrock-db-prod
5+
STATIC_URL=https://www.mozilla.org/media/
6+
DJANGO_SETTINGS_MODULE=bedrock.settings
7+
GRANIAN_BACKPRESSURE=64
8+
GRANIAN_WORKERS=4
9+
GRANIAN_BLOCKING_THREADS=4

sitemap/Dockerfile

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
FROM mozmeao/bedrock:prod-latest
2+
USER root
3+
COPY settings_local.py bedrock/settings/local.py
4+
COPY run-generator.sh ./
5+
ARG USER_ID=1000:1000
6+
ENV USER_ID=${USER_ID}
7+
RUN chown -R "${USER_ID}" /app
8+
USER ${USER_ID}

sitemap/generate_sitemap_docker.sh

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
3+
set -exo pipefail
4+
5+
cd sitemap
6+
7+
docker build --pull \
8+
-t sitemap-generator \
9+
--build-arg "USER_ID=$(id -u):$(id -g)" \
10+
.
11+
12+
docker run --rm \
13+
--env-file .bedrock.env \
14+
-v "$PWD:/app/sitemap-data" \
15+
sitemap-generator ./run-generator.sh

sitemap/run-generator.sh

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#!/bin/bash
2+
set -ex
3+
bin/run-db-download.py
4+
python manage.py l10n_update
5+
python manage.py update_sitemaps
6+
cp /app/root_files/sitemap.json /app/sitemap-data/

0 commit comments

Comments
 (0)