first commit
This commit is contained in:
commit
06a29f4640
4 changed files with 518 additions and 0 deletions
101
README.md
Normal file
101
README.md
Normal file
|
|
@ -0,0 +1,101 @@
|
|||
# 🌐 Website Downloader CLI
|
||||
[](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)
|
||||
[](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)
|
||||
[](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://www.python.org/)
|
||||
[](https://github.com/psf/black)
|
||||
|
||||
Website Downloader CLI is a **tiny, pure-Python** site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly reachable website:
|
||||
|
||||
* Recursively crawls every same-origin link (including “pretty” `/about/` URLs)
|
||||
* Downloads **all** assets (images, CSS, JS, …)
|
||||
* Rewrites internal links so pages open flawlessly from your local disk
|
||||
* Streams files concurrently with automatic retry / back-off
|
||||
* Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …)
|
||||
* Handles extremely long filenames safely via hashing and graceful fallbacks
|
||||
|
||||
> Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Grab the code
|
||||
git clone https://github.com/PKHarsimran/website-downloader.git
|
||||
cd website-downloader
|
||||
|
||||
# 2. Install dependencies (only two runtime libs!)
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 3. Mirror a site – no prompts needed
|
||||
python website-downloader.py \
|
||||
--url https://harsim.ca \
|
||||
--destination harsim_ca_backup \
|
||||
--max-pages 100 \
|
||||
--threads 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ Libraries Used
|
||||
|
||||
| Library | Emoji | Purpose in this project |
|
||||
|---------|-------|-------------------------|
|
||||
| **requests** + **urllib3.Retry** | 🌐 | High-level HTTP client with automatic retry / back-off for flaky hosts |
|
||||
| **BeautifulSoup (bs4)** | 🍜 | Parses downloaded HTML and extracts every `<a>`, `<img>`, `<script>`, and `<link>` |
|
||||
| **argparse** | 🛠️ | Powers the modern CLI (`--url`, `--destination`, `--max-pages`, `--threads`, …) |
|
||||
| **logging** | 📝 | Dual console / file logging with colour + crawl-time stats |
|
||||
| **threading** & **queue** | ⚙️ | Lightweight thread-pool that streams images/CSS/JS concurrently |
|
||||
| **pathlib** & **os** | 📂 | Cross-platform file-system helpers (`Path` magic, directory creation, etc.) |
|
||||
| **time** | ⏱️ | Measures per-page latency and total crawl duration |
|
||||
| **urllib.parse** | 🔗 | Safely joins / analyses URLs and rewrites them to local relative paths |
|
||||
| **sys** | 🖥️ | Directs log output to `stdout` and handles graceful interrupts (`Ctrl-C`) |
|
||||
## 🗂️ Project Structure
|
||||
|
||||
| Path | What it is | Key features |
|
||||
|------|------------|--------------|
|
||||
| `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | • Persistent `requests.Session` with automatic retries<br>• Breadth-first crawl capped by `--max-pages` (default = 50)<br>• Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ `index.html`, plain paths ➜ `.html`)<br>• Smart output folder naming (`example.com` → `example_com`)<br>• Colourised console + file logging with per-page latency and crawl summary |
|
||||
| `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python ≥ 3.10 std-lib. |
|
||||
| `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |
|
||||
| `README.md` | The document you’re reading – quick-start, flags, and architecture notes. |
|
||||
| *(output folder)* | Created at runtime (`example_com/ …`) – mirrors the remote directory tree with `index.html` stubs and all static assets. |
|
||||
|
||||
> **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.
|
||||
|
||||
## ✨ Recent Improvements
|
||||
|
||||
✅ Type Conversion Fix
|
||||
Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.
|
||||
|
||||
✅ Safer Path Handling
|
||||
Added intelligent path shortening and hashing for long filenames to prevent
|
||||
OSError: [Errno 36] File name too long errors.
|
||||
|
||||
✅ Improved CLI Experience
|
||||
Rebuilt argument parsing with argparse for cleaner syntax and validation.
|
||||
|
||||
✅ Code Quality & Linting
|
||||
Applied Black + Flake8 formatting; the project now passes all CI lint checks.
|
||||
|
||||
✅ Logging & Stability
|
||||
Improved error handling, logging, and fallback mechanisms for failed writes.
|
||||
|
||||
✅ Skip Non-Fetchable Schemes
|
||||
The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.
|
||||
This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.
|
||||
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
|
||||
|
||||
## 📜 License
|
||||
|
||||
This project is licensed under the MIT License.
|
||||
|
||||
## ❤️ Support This Project
|
||||
|
||||
[](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader¤cy_code=CAD)
|
||||
|
||||
7
downloadsite.sh
Executable file
7
downloadsite.sh
Executable file
|
|
@ -0,0 +1,7 @@
|
|||
#!/bin/bash
|
||||
source /usr/local/python/website-downloader/.venv/bin/activate
|
||||
python /usr/local/python/website-downloader/website-downloader.py \
|
||||
--url $1 \
|
||||
--destination $2 \
|
||||
--max-pages 100 \
|
||||
--threads 8
|
||||
4
requirements.txt
Normal file
4
requirements.txt
Normal file
|
|
@ -0,0 +1,4 @@
|
|||
requests~=2.32.4
|
||||
beautifulsoup4~=4.13.4
|
||||
wget~=3.2
|
||||
urllib3~=2.5.0
|
||||
406
website-downloader.py
Executable file
406
website-downloader.py
Executable file
|
|
@ -0,0 +1,406 @@
|
|||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import queue
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
from hashlib import sha256
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from urllib.parse import urljoin, urlparse
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util import Retry
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Config / constants
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
LOG_FMT = "%(asctime)s | %(levelname)-8s | %(threadName)s | %(message)s"
|
||||
|
||||
DEFAULT_HEADERS = {
|
||||
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) "
|
||||
"Gecko/20100101 Firefox/128.0"
|
||||
}
|
||||
|
||||
TIMEOUT = 15 # seconds
|
||||
CHUNK_SIZE = 8192 # bytes
|
||||
|
||||
# Conservative margins under common OS limits (~255–260 bytes)
|
||||
MAX_PATH_LEN = 240
|
||||
MAX_SEG_LEN = 120
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Logging
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
logging.basicConfig(
|
||||
filename="web_scraper.log",
|
||||
level=logging.DEBUG,
|
||||
format=LOG_FMT,
|
||||
datefmt="%H:%M:%S",
|
||||
force=True,
|
||||
)
|
||||
_console = logging.StreamHandler(sys.stdout)
|
||||
_console.setLevel(logging.INFO)
|
||||
_console.setFormatter(logging.Formatter(LOG_FMT, datefmt="%H:%M:%S"))
|
||||
logging.getLogger().addHandler(_console)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# HTTP session (retry, timeouts, custom UA)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
SESSION = requests.Session()
|
||||
RETRY_STRAT = Retry(
|
||||
total=5,
|
||||
backoff_factor=0.5,
|
||||
status_forcelist=[429, 500, 502, 503, 504],
|
||||
allowed_methods=["GET", "HEAD"],
|
||||
)
|
||||
SESSION.mount("http://", HTTPAdapter(max_retries=RETRY_STRAT))
|
||||
SESSION.mount("https://", HTTPAdapter(max_retries=RETRY_STRAT))
|
||||
SESSION.headers.update(DEFAULT_HEADERS)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def create_dir(path: Path) -> None:
|
||||
"""Create path (and parents) if it does not already exist."""
|
||||
if not path.exists():
|
||||
path.mkdir(parents=True, exist_ok=True)
|
||||
log.debug("Created directory %s", path)
|
||||
|
||||
|
||||
def sanitize(url_fragment: str) -> str:
|
||||
"""Strip back-references and Windows backslashes."""
|
||||
return url_fragment.replace("\\", "/").replace("..", "").strip()
|
||||
|
||||
|
||||
NON_FETCHABLE_SCHEMES = {"mailto", "tel", "sms", "javascript", "data", "geo", "blob"}
|
||||
|
||||
|
||||
def is_httpish(u: str) -> bool:
|
||||
"""True iff the URL is http(s) or relative (no scheme)."""
|
||||
p = urlparse(u)
|
||||
return (p.scheme in ("http", "https")) or (p.scheme == "")
|
||||
|
||||
|
||||
def is_non_fetchable(u: str) -> bool:
|
||||
"""True iff the URL clearly shouldn't be fetched (mailto:, tel:, data:, ...)."""
|
||||
p = urlparse(u)
|
||||
return p.scheme in NON_FETCHABLE_SCHEMES
|
||||
|
||||
|
||||
def is_internal(link: str, root_netloc: str) -> bool:
|
||||
"""Return True if link belongs to root_netloc (or is protocol-relative)."""
|
||||
parsed = urlparse(link)
|
||||
return not parsed.netloc or parsed.netloc == root_netloc
|
||||
|
||||
|
||||
def _shorten_segment(segment: str, limit: int = MAX_SEG_LEN) -> str:
|
||||
"""
|
||||
Shorten a single path segment if over limit.
|
||||
Preserve extension; append a short hash to keep it unique.
|
||||
"""
|
||||
if len(segment) <= limit:
|
||||
return segment
|
||||
p = Path(segment)
|
||||
stem, suffix = p.stem, p.suffix
|
||||
h = sha256(segment.encode("utf-8")).hexdigest()[:12]
|
||||
# leave room for '-' + hash + suffix
|
||||
keep = max(0, limit - len(suffix) - 13)
|
||||
return f"{stem[:keep]}-{h}{suffix}"
|
||||
|
||||
|
||||
def to_local_path(parsed: urlparse, site_root: Path) -> Path:
|
||||
"""
|
||||
Map an internal URL to a local file path under site_root.
|
||||
|
||||
- Adds 'index.html' where appropriate.
|
||||
- Converts extensionless paths to '.html'.
|
||||
- Appends a short query-hash when ?query is present to avoid collisions.
|
||||
- Enforces per-segment and overall path length limits. If still too long,
|
||||
hashes the leaf name.
|
||||
"""
|
||||
rel = parsed.path.lstrip("/")
|
||||
if not rel:
|
||||
rel = "index.html"
|
||||
elif rel.endswith("/"):
|
||||
rel += "index.html"
|
||||
elif not Path(rel).suffix:
|
||||
rel += ".html"
|
||||
|
||||
if parsed.query:
|
||||
qh = sha256(parsed.query.encode("utf-8")).hexdigest()[:10]
|
||||
p = Path(rel)
|
||||
rel = str(p.with_name(f"{p.stem}-q{qh}{p.suffix}"))
|
||||
|
||||
# Shorten individual segments
|
||||
parts = Path(rel).parts
|
||||
parts = tuple(_shorten_segment(seg, MAX_SEG_LEN) for seg in parts)
|
||||
local_path = site_root / Path(*parts)
|
||||
|
||||
# If full path is still too long, hash the leaf
|
||||
if len(str(local_path)) > MAX_PATH_LEN:
|
||||
p = local_path
|
||||
h = sha256(parsed.geturl().encode("utf-8")).hexdigest()[:16]
|
||||
leaf = _shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
|
||||
local_path = p.with_name(leaf)
|
||||
|
||||
return local_path
|
||||
|
||||
|
||||
def safe_write_text(path: Path, text: str, encoding: str = "utf-8") -> Path:
|
||||
"""
|
||||
Write text to path, falling back to a hashed filename if OS rejects it
|
||||
(e.g., filename too long). Returns the final path used.
|
||||
"""
|
||||
try:
|
||||
path.write_text(text, encoding=encoding)
|
||||
return path
|
||||
except OSError as exc:
|
||||
log.warning("Write failed for %s: %s. Falling back to hashed leaf.", path, exc)
|
||||
p = path
|
||||
h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
|
||||
fallback = p.with_name(_shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN))
|
||||
create_dir(fallback.parent)
|
||||
fallback.write_text(text, encoding=encoding)
|
||||
return fallback
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fetchers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def fetch_html(url: str) -> Optional[BeautifulSoup]:
|
||||
"""Download url and return a BeautifulSoup tree (or None on error)."""
|
||||
try:
|
||||
resp = SESSION.get(url, timeout=TIMEOUT)
|
||||
resp.raise_for_status()
|
||||
return BeautifulSoup(resp.text, "html.parser")
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning("HTTP error for %s – %s", url, exc)
|
||||
return None
|
||||
|
||||
|
||||
def fetch_binary(url: str, dest: Path) -> None:
|
||||
"""Stream url to dest unless it already exists. Safe against long paths."""
|
||||
if dest.exists():
|
||||
return
|
||||
try:
|
||||
resp = SESSION.get(url, timeout=TIMEOUT, stream=True)
|
||||
resp.raise_for_status()
|
||||
create_dir(dest.parent)
|
||||
try:
|
||||
with dest.open("wb") as fh:
|
||||
for chunk in resp.iter_content(CHUNK_SIZE):
|
||||
fh.write(chunk)
|
||||
log.debug("Saved resource -> %s", dest)
|
||||
except OSError as exc:
|
||||
# Fallback to hashed leaf if OS rejects path
|
||||
log.warning("Binary write failed for %s: %s. Using fallback.", dest, exc)
|
||||
p = dest
|
||||
h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
|
||||
fallback = p.with_name(
|
||||
_shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
|
||||
)
|
||||
create_dir(fallback.parent)
|
||||
with fallback.open("wb") as fh:
|
||||
for chunk in resp.iter_content(CHUNK_SIZE):
|
||||
fh.write(chunk)
|
||||
log.debug("Saved resource (fallback) -> %s", fallback)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.error("Failed to save %s – %s", url, exc)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Link rewriting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def rewrite_links(
|
||||
soup: BeautifulSoup, page_url: str, site_root: Path, page_dir: Path
|
||||
) -> None:
|
||||
"""Rewrite internal links to local relative paths under site_root."""
|
||||
root_netloc = urlparse(page_url).netloc
|
||||
for tag in soup.find_all(["a", "img", "script", "link"]):
|
||||
attr = "href" if tag.name in {"a", "link"} else "src"
|
||||
if not tag.has_attr(attr):
|
||||
continue
|
||||
original = sanitize(tag[attr])
|
||||
if (
|
||||
original.startswith("#")
|
||||
or is_non_fetchable(original)
|
||||
or not is_httpish(original)
|
||||
):
|
||||
continue
|
||||
abs_url = urljoin(page_url, original)
|
||||
if not is_internal(abs_url, root_netloc):
|
||||
continue # external – leave untouched
|
||||
local_path = to_local_path(urlparse(abs_url), site_root)
|
||||
try:
|
||||
tag[attr] = os.path.relpath(local_path, page_dir)
|
||||
except ValueError:
|
||||
# Different drives on Windows, etc.
|
||||
tag[attr] = str(local_path)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Crawl coordinator
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def crawl_site(start_url: str, root: Path, max_pages: int, threads: int) -> None:
|
||||
"""Breadth-first crawl limited to max_pages. Downloads assets via workers."""
|
||||
q_pages: queue.Queue[str] = queue.Queue()
|
||||
q_pages.put(start_url)
|
||||
seen_pages: set[str] = set()
|
||||
download_q: queue.Queue[tuple[str, Path]] = queue.Queue()
|
||||
|
||||
def worker() -> None:
|
||||
while True:
|
||||
try:
|
||||
url, dest = download_q.get(timeout=3)
|
||||
except queue.Empty:
|
||||
return
|
||||
if is_non_fetchable(url) or not is_httpish(url):
|
||||
log.debug("Skip non-fetchable: %s", url)
|
||||
download_q.task_done()
|
||||
continue
|
||||
fetch_binary(url, dest)
|
||||
download_q.task_done()
|
||||
|
||||
workers: list[threading.Thread] = []
|
||||
for i in range(max(1, threads)):
|
||||
t = threading.Thread(target=worker, name=f"DL-{i+1}", daemon=True)
|
||||
t.start()
|
||||
workers.append(t)
|
||||
|
||||
start_time = time.time()
|
||||
root_netloc = urlparse(start_url).netloc
|
||||
|
||||
while not q_pages.empty() and len(seen_pages) < max_pages:
|
||||
page_url = q_pages.get()
|
||||
if page_url in seen_pages:
|
||||
continue
|
||||
seen_pages.add(page_url)
|
||||
log.info("[%s/%s] %s", len(seen_pages), max_pages, page_url)
|
||||
|
||||
soup = fetch_html(page_url)
|
||||
if soup is None:
|
||||
continue
|
||||
|
||||
# Gather links & assets
|
||||
for tag in soup.find_all(["img", "script", "link", "a"]):
|
||||
link = tag.get("src") or tag.get("href")
|
||||
if not link:
|
||||
continue
|
||||
link = sanitize(link)
|
||||
if link.startswith("#") or is_non_fetchable(link) or not is_httpish(link):
|
||||
continue
|
||||
abs_url = urljoin(page_url, link)
|
||||
parsed = urlparse(abs_url)
|
||||
if not is_internal(abs_url, root_netloc):
|
||||
continue
|
||||
|
||||
dest_path = to_local_path(parsed, root)
|
||||
# HTML?
|
||||
if parsed.path.endswith("/") or not Path(parsed.path).suffix:
|
||||
if abs_url not in seen_pages and abs_url not in list(
|
||||
q_pages.queue
|
||||
): # type: ignore[arg-type]
|
||||
q_pages.put(abs_url)
|
||||
else:
|
||||
download_q.put((abs_url, dest_path))
|
||||
|
||||
# Save current page
|
||||
local_path = to_local_path(urlparse(page_url), root)
|
||||
create_dir(local_path.parent)
|
||||
rewrite_links(soup, page_url, root, local_path.parent)
|
||||
html = soup.prettify()
|
||||
final_path = safe_write_text(local_path, html, encoding="utf-8")
|
||||
log.debug("Saved page %s", final_path)
|
||||
|
||||
download_q.join()
|
||||
elapsed = time.time() - start_time
|
||||
if seen_pages:
|
||||
log.info(
|
||||
"Crawl finished: %s pages in %.2fs (%.2fs avg)",
|
||||
len(seen_pages),
|
||||
elapsed,
|
||||
elapsed / len(seen_pages),
|
||||
)
|
||||
else:
|
||||
log.warning("Nothing downloaded – check URL or connectivity")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helper function for output folder
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def make_root(url: str, custom: Optional[str]) -> Path:
|
||||
"""Derive output folder from URL if custom not supplied."""
|
||||
return Path(custom) if custom else Path(urlparse(url).netloc.replace(".", "_"))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def parse_args() -> argparse.Namespace:
|
||||
p = argparse.ArgumentParser(
|
||||
description="Recursively mirror a website for offline use.",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
p.add_argument(
|
||||
"--url",
|
||||
required=True,
|
||||
help="Starting URL to crawl (e.g., https://example.com/).",
|
||||
)
|
||||
p.add_argument(
|
||||
"--destination",
|
||||
default=None,
|
||||
help="Output folder (defaults to a folder derived from the URL).",
|
||||
)
|
||||
p.add_argument(
|
||||
"--max-pages",
|
||||
type=int,
|
||||
default=50,
|
||||
help="Maximum number of HTML pages to crawl.",
|
||||
)
|
||||
p.add_argument(
|
||||
"--threads",
|
||||
type=int,
|
||||
default=6,
|
||||
help="Number of concurrent download workers.",
|
||||
)
|
||||
return p.parse_args()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
args = parse_args()
|
||||
if args.max_pages < 1:
|
||||
log.error("--max-pages must be >= 1")
|
||||
sys.exit(2)
|
||||
if args.threads < 1:
|
||||
log.error("--threads must be >= 1")
|
||||
sys.exit(2)
|
||||
|
||||
host = args.url
|
||||
root = make_root(args.url, args.destination)
|
||||
crawl_site(host, root, args.max_pages, args.threads)
|
||||
Loading…
Reference in a new issue