first commit

2025-12-10 16:53:24 +01:00 · 2025-12-10 16:53:24 +01:00 · 06a29f4640
commit 06a29f4640
4 changed files with 518 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,101 @@
 # 🌐 Website Downloader CLI  
 [![CI – Website Downloader](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/python-app.yml)
 [![Lint & Style](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/lint.yml)
 [![Automatic Dependency Submission](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission/badge.svg)](https://github.com/PKHarsimran/website-downloader/actions/workflows/dependency-graph/auto-submission)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
 [![Code style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 Website Downloader CLI is a **tiny, pure-Python** site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly reachable website:
 * Recursively crawls every same-origin link (including “pretty” `/about/` URLs)
 * Downloads **all** assets (images, CSS, JS, …)
 * Rewrites internal links so pages open flawlessly from your local disk
 * Streams files concurrently with automatic retry / back-off
 * Generates a clean, flat directory tree (`example_com/index.html`, `example_com/about/index.html`, …)
 * Handles extremely long filenames safely via hashing and graceful fallbacks
 > Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.
 ---
 ## 🚀 Quick Start
 ```bash
 # 1. Grab the code
 git clone https://github.com/PKHarsimran/website-downloader.git
 cd website-downloader
 # 2. Install dependencies (only two runtime libs!)
 pip install -r requirements.txt
 # 3. Mirror a site – no prompts needed
 python website-downloader.py \
    --url https://harsim.ca \
    --destination harsim_ca_backup \
    --max-pages 100 \
    --threads 8
 ```
 ---
 ## 🛠️ Libraries Used
 | Library | Emoji | Purpose in this project |
 |---------|-------|-------------------------|
 | **requests** + **urllib3.Retry** | 🌐 | High-level HTTP client with automatic retry / back-off for flaky hosts |
 | **BeautifulSoup (bs4)** | 🍜 | Parses downloaded HTML and extracts every `<a>`, `<img>`, `<script>`, and `<link>` |
 | **argparse** | 🛠️ | Powers the modern CLI (`--url`, `--destination`, `--max-pages`, `--threads`, …) |
 | **logging** | 📝 | Dual console / file logging with colour + crawl-time stats |
 | **threading** & **queue** | ⚙️ | Lightweight thread-pool that streams images/CSS/JS concurrently |
 | **pathlib** & **os** | 📂 | Cross-platform file-system helpers (`Path` magic, directory creation, etc.) |
 | **time** | ⏱️ | Measures per-page latency and total crawl duration |
 | **urllib.parse** | 🔗 | Safely joins / analyses URLs and rewrites them to local relative paths |
 | **sys** | 🖥️ | Directs log output to `stdout` and handles graceful interrupts (`Ctrl-C`) |
 ## 🗂️ Project Structure
 | Path | What it is | Key features |
 |------|------------|--------------|
 | `website_downloader.py` | **Single-entry CLI** that performs the entire crawl *and* link-rewriting pipeline. | • Persistent `requests.Session` with automatic retries<br>• Breadth-first crawl capped by `--max-pages` (default = 50)<br>• Thread-pool (configurable via `--threads`, default = 6) to fetch images/CSS/JS in parallel<br>• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ `index.html`, plain paths ➜ `.html`)<br>• Smart output folder naming (`example.com` → `example_com`)<br>• Colourised console + file logging with per-page latency and crawl summary |
 | `requirements.txt` | Minimal dependency pin-list. Only **`requests`** and **`beautifulsoup4`** are third-party; everything else is Python ≥ 3.10 std-lib. |
 | `web_scraper.log` | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. |
 | `README.md` | The document you’re reading – quick-start, flags, and architecture notes. |
 | *(output folder)* | Created at runtime (`example_com/ …`) – mirrors the remote directory tree with `index.html` stubs and all static assets. |
 > **Removed:** The old `check_download.py` verifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.
 ## ✨ Recent Improvements
 ✅ Type Conversion Fix
 Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.
 ✅ Safer Path Handling
 Added intelligent path shortening and hashing for long filenames to prevent
 OSError: [Errno 36] File name too long errors.
 ✅ Improved CLI Experience
 Rebuilt argument parsing with argparse for cleaner syntax and validation.
 ✅ Code Quality & Linting
 Applied Black + Flake8 formatting; the project now passes all CI lint checks.
 ✅ Logging & Stability
 Improved error handling, logging, and fallback mechanisms for failed writes.
 ✅ Skip Non-Fetchable Schemes  
 The crawler now safely skips `mailto:`, `tel:`, `javascript:`, and `data:` links instead of trying to download them.  
 This prevents `requests.exceptions.InvalidSchema: No connection adapters were found` errors and keeps those links intact in saved HTML.
 ## 🤝 Contributing
 Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
 ## 📜 License
 This project is licensed under the MIT License.
 ## ❤️ Support This Project
 [![Donate](https://img.shields.io/badge/Donate-PayPal-blue)](https://www.paypal.com/donate/?business=MVEWG3QAX6UBC&no_recurring=1&item_name=Github+Project+-+Website+downloader&currency_code=CAD)
--- a/downloadsite.sh
+++ b/downloadsite.sh
@ -0,0 +1,7 @@
 #!/bin/bash
 source /usr/local/python/website-downloader/.venv/bin/activate
 python /usr/local/python/website-downloader/website-downloader.py \
    --url $1 \
    --destination $2 \
    --max-pages 100 \
    --threads 8
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,4 @@
 requests~=2.32.4
 beautifulsoup4~=4.13.4
 wget~=3.2
 urllib3~=2.5.0
--- a/website-downloader.py
+++ b/website-downloader.py
@ -0,0 +1,406 @@
 #!/usr/bin/env python3
 from __future__ import annotations
 import argparse
 import logging
 import os
 import queue
 import sys
 import threading
 import time
 from hashlib import sha256
 from pathlib import Path
 from typing import Optional
 from urllib.parse import urljoin, urlparse
 import requests
 from bs4 import BeautifulSoup
 from requests.adapters import HTTPAdapter
 from urllib3.util import Retry
 # ---------------------------------------------------------------------------
 # Config / constants
 # ---------------------------------------------------------------------------
 LOG_FMT = "%(asctime)s | %(levelname)-8s | %(threadName)s | %(message)s"
 DEFAULT_HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:128.0) "
    "Gecko/20100101 Firefox/128.0"
 }
 TIMEOUT = 15  # seconds
 CHUNK_SIZE = 8192  # bytes
 # Conservative margins under common OS limits (~255–260 bytes)
 MAX_PATH_LEN = 240
 MAX_SEG_LEN = 120
 # ---------------------------------------------------------------------------
 # Logging
 # ---------------------------------------------------------------------------
 logging.basicConfig(
    filename="web_scraper.log",
    level=logging.DEBUG,
    format=LOG_FMT,
    datefmt="%H:%M:%S",
    force=True,
 )
 _console = logging.StreamHandler(sys.stdout)
 _console.setLevel(logging.INFO)
 _console.setFormatter(logging.Formatter(LOG_FMT, datefmt="%H:%M:%S"))
 logging.getLogger().addHandler(_console)
 log = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
 # HTTP session (retry, timeouts, custom UA)
 # ---------------------------------------------------------------------------
 SESSION = requests.Session()
 RETRY_STRAT = Retry(
    total=5,
    backoff_factor=0.5,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET", "HEAD"],
 )
 SESSION.mount("http://", HTTPAdapter(max_retries=RETRY_STRAT))
 SESSION.mount("https://", HTTPAdapter(max_retries=RETRY_STRAT))
 SESSION.headers.update(DEFAULT_HEADERS)
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def create_dir(path: Path) -> None:
    """Create path (and parents) if it does not already exist."""
    if not path.exists():
        path.mkdir(parents=True, exist_ok=True)
        log.debug("Created directory %s", path)
 def sanitize(url_fragment: str) -> str:
    """Strip back-references and Windows backslashes."""
    return url_fragment.replace("\\", "/").replace("..", "").strip()
 NON_FETCHABLE_SCHEMES = {"mailto", "tel", "sms", "javascript", "data", "geo", "blob"}
 def is_httpish(u: str) -> bool:
    """True iff the URL is http(s) or relative (no scheme)."""
    p = urlparse(u)
    return (p.scheme in ("http", "https")) or (p.scheme == "")
 def is_non_fetchable(u: str) -> bool:
    """True iff the URL clearly shouldn't be fetched (mailto:, tel:, data:, ...)."""
    p = urlparse(u)
    return p.scheme in NON_FETCHABLE_SCHEMES
 def is_internal(link: str, root_netloc: str) -> bool:
    """Return True if link belongs to root_netloc (or is protocol-relative)."""
    parsed = urlparse(link)
    return not parsed.netloc or parsed.netloc == root_netloc
 def _shorten_segment(segment: str, limit: int = MAX_SEG_LEN) -> str:
    """
    Shorten a single path segment if over limit.
    Preserve extension; append a short hash to keep it unique.
    """
    if len(segment) <= limit:
        return segment
    p = Path(segment)
    stem, suffix = p.stem, p.suffix
    h = sha256(segment.encode("utf-8")).hexdigest()[:12]
    # leave room for '-' + hash + suffix
    keep = max(0, limit - len(suffix) - 13)
    return f"{stem[:keep]}-{h}{suffix}"
 def to_local_path(parsed: urlparse, site_root: Path) -> Path:
    """
    Map an internal URL to a local file path under site_root.
    - Adds 'index.html' where appropriate.
    - Converts extensionless paths to '.html'.
    - Appends a short query-hash when ?query is present to avoid collisions.
    - Enforces per-segment and overall path length limits. If still too long,
      hashes the leaf name.
    """
    rel = parsed.path.lstrip("/")
    if not rel:
        rel = "index.html"
    elif rel.endswith("/"):
        rel += "index.html"
    elif not Path(rel).suffix:
        rel += ".html"
    if parsed.query:
        qh = sha256(parsed.query.encode("utf-8")).hexdigest()[:10]
        p = Path(rel)
        rel = str(p.with_name(f"{p.stem}-q{qh}{p.suffix}"))
    # Shorten individual segments
    parts = Path(rel).parts
    parts = tuple(_shorten_segment(seg, MAX_SEG_LEN) for seg in parts)
    local_path = site_root / Path(*parts)
    # If full path is still too long, hash the leaf
    if len(str(local_path)) > MAX_PATH_LEN:
        p = local_path
        h = sha256(parsed.geturl().encode("utf-8")).hexdigest()[:16]
        leaf = _shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
        local_path = p.with_name(leaf)
    return local_path
 def safe_write_text(path: Path, text: str, encoding: str = "utf-8") -> Path:
    """
    Write text to path, falling back to a hashed filename if OS rejects it
    (e.g., filename too long). Returns the final path used.
    """
    try:
        path.write_text(text, encoding=encoding)
        return path
    except OSError as exc:
        log.warning("Write failed for %s: %s. Falling back to hashed leaf.", path, exc)
        p = path
        h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
        fallback = p.with_name(_shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN))
        create_dir(fallback.parent)
        fallback.write_text(text, encoding=encoding)
        return fallback
 # ---------------------------------------------------------------------------
 # Fetchers
 # ---------------------------------------------------------------------------
 def fetch_html(url: str) -> Optional[BeautifulSoup]:
    """Download url and return a BeautifulSoup tree (or None on error)."""
    try:
        resp = SESSION.get(url, timeout=TIMEOUT)
        resp.raise_for_status()
        return BeautifulSoup(resp.text, "html.parser")
    except Exception as exc:  # noqa: BLE001
        log.warning("HTTP error for %s – %s", url, exc)
        return None
 def fetch_binary(url: str, dest: Path) -> None:
    """Stream url to dest unless it already exists. Safe against long paths."""
    if dest.exists():
        return
    try:
        resp = SESSION.get(url, timeout=TIMEOUT, stream=True)
        resp.raise_for_status()
        create_dir(dest.parent)
        try:
            with dest.open("wb") as fh:
                for chunk in resp.iter_content(CHUNK_SIZE):
                    fh.write(chunk)
            log.debug("Saved resource -> %s", dest)
        except OSError as exc:
            # Fallback to hashed leaf if OS rejects path
            log.warning("Binary write failed for %s: %s. Using fallback.", dest, exc)
            p = dest
            h = sha256(str(p).encode("utf-8")).hexdigest()[:16]
            fallback = p.with_name(
                _shorten_segment(f"{p.stem}-{h}{p.suffix}", MAX_SEG_LEN)
            )
            create_dir(fallback.parent)
            with fallback.open("wb") as fh:
                for chunk in resp.iter_content(CHUNK_SIZE):
                    fh.write(chunk)
            log.debug("Saved resource (fallback) -> %s", fallback)
    except Exception as exc:  # noqa: BLE001
        log.error("Failed to save %s – %s", url, exc)
 # ---------------------------------------------------------------------------
 # Link rewriting
 # ---------------------------------------------------------------------------
 def rewrite_links(
    soup: BeautifulSoup, page_url: str, site_root: Path, page_dir: Path
 ) -> None:
    """Rewrite internal links to local relative paths under site_root."""
    root_netloc = urlparse(page_url).netloc
    for tag in soup.find_all(["a", "img", "script", "link"]):
        attr = "href" if tag.name in {"a", "link"} else "src"
        if not tag.has_attr(attr):
            continue
        original = sanitize(tag[attr])
        if (
            original.startswith("#")
            or is_non_fetchable(original)
            or not is_httpish(original)
        ):
            continue
        abs_url = urljoin(page_url, original)
        if not is_internal(abs_url, root_netloc):
            continue  # external – leave untouched
        local_path = to_local_path(urlparse(abs_url), site_root)
        try:
            tag[attr] = os.path.relpath(local_path, page_dir)
        except ValueError:
            # Different drives on Windows, etc.
            tag[attr] = str(local_path)
 # ---------------------------------------------------------------------------
 # Crawl coordinator
 # ---------------------------------------------------------------------------
 def crawl_site(start_url: str, root: Path, max_pages: int, threads: int) -> None:
    """Breadth-first crawl limited to max_pages. Downloads assets via workers."""
    q_pages: queue.Queue[str] = queue.Queue()
    q_pages.put(start_url)
    seen_pages: set[str] = set()
    download_q: queue.Queue[tuple[str, Path]] = queue.Queue()
    def worker() -> None:
        while True:
            try:
                url, dest = download_q.get(timeout=3)
            except queue.Empty:
                return
            if is_non_fetchable(url) or not is_httpish(url):
                log.debug("Skip non-fetchable: %s", url)
                download_q.task_done()
                continue
            fetch_binary(url, dest)
            download_q.task_done()
    workers: list[threading.Thread] = []
    for i in range(max(1, threads)):
        t = threading.Thread(target=worker, name=f"DL-{i+1}", daemon=True)
        t.start()
        workers.append(t)
    start_time = time.time()
    root_netloc = urlparse(start_url).netloc
    while not q_pages.empty() and len(seen_pages) < max_pages:
        page_url = q_pages.get()
        if page_url in seen_pages:
            continue
        seen_pages.add(page_url)
        log.info("[%s/%s] %s", len(seen_pages), max_pages, page_url)
        soup = fetch_html(page_url)
        if soup is None:
            continue
        # Gather links & assets
        for tag in soup.find_all(["img", "script", "link", "a"]):
            link = tag.get("src") or tag.get("href")
            if not link:
                continue
            link = sanitize(link)
            if link.startswith("#") or is_non_fetchable(link) or not is_httpish(link):
                continue
            abs_url = urljoin(page_url, link)
            parsed = urlparse(abs_url)
            if not is_internal(abs_url, root_netloc):
                continue
            dest_path = to_local_path(parsed, root)
            # HTML?
            if parsed.path.endswith("/") or not Path(parsed.path).suffix:
                if abs_url not in seen_pages and abs_url not in list(
                    q_pages.queue
                ):  # type: ignore[arg-type]
                    q_pages.put(abs_url)
            else:
                download_q.put((abs_url, dest_path))
        # Save current page
        local_path = to_local_path(urlparse(page_url), root)
        create_dir(local_path.parent)
        rewrite_links(soup, page_url, root, local_path.parent)
        html = soup.prettify()
        final_path = safe_write_text(local_path, html, encoding="utf-8")
        log.debug("Saved page %s", final_path)
    download_q.join()
    elapsed = time.time() - start_time
    if seen_pages:
        log.info(
            "Crawl finished: %s pages in %.2fs (%.2fs avg)",
            len(seen_pages),
            elapsed,
            elapsed / len(seen_pages),
        )
    else:
        log.warning("Nothing downloaded – check URL or connectivity")
 # ---------------------------------------------------------------------------
 # Helper function for output folder
 # ---------------------------------------------------------------------------
 def make_root(url: str, custom: Optional[str]) -> Path:
    """Derive output folder from URL if custom not supplied."""
    return Path(custom) if custom else Path(urlparse(url).netloc.replace(".", "_"))
 # ---------------------------------------------------------------------------
 # CLI
 # ---------------------------------------------------------------------------
 def parse_args() -> argparse.Namespace:
    p = argparse.ArgumentParser(
        description="Recursively mirror a website for offline use.",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    p.add_argument(
        "--url",
        required=True,
        help="Starting URL to crawl (e.g., https://example.com/).",
    )
    p.add_argument(
        "--destination",
        default=None,
        help="Output folder (defaults to a folder derived from the URL).",
    )
    p.add_argument(
        "--max-pages",
        type=int,
        default=50,
        help="Maximum number of HTML pages to crawl.",
    )
    p.add_argument(
        "--threads",
        type=int,
        default=6,
        help="Number of concurrent download workers.",
    )
    return p.parse_args()
 if __name__ == "__main__":
    args = parse_args()
    if args.max_pages < 1:
        log.error("--max-pages must be >= 1")
        sys.exit(2)
    if args.threads < 1:
        log.error("--threads must be >= 1")
        sys.exit(2)
    host = args.url
    root = make_root(args.url, args.destination)
    crawl_site(host, root, args.max_pages, args.threads)