WebHarvest

Web asset extractor and link harvester for OSINT and reconnaissance

v1.0.1

Linux

Quick Start

Install

curl -fsSL https://cli.johlem.net/install.sh | bash -s -- webharvest

Uninstall

curl -fsSL https://cli.johlem.net/uninstall.sh | bash -s -- webharvest

NixOS / Nix

nix profile install "tarball+https://cli.johlem.net/releases/cli-johlem-net-latest.tar.gz#webharvest"

Features

Four subcommands: links, crawl, download, report
Link extraction: anchors, images, scripts, stylesheets, iframes, media
JS URL extraction — pull API endpoints out of inline `<script>` bodies
robots.txt + sitemap.xml auto-discovery on every run
HTML comment + meta-generator fingerprint (WordPress version, generator, framework)
Form discovery — action / method / field names / has-password-field flag
Recursive same-origin crawl by default; `--include-external` opt-in
Concurrent fetches with `--concurrency N` (tokio semaphore)
Asset download with SHA-256 hashing for diff-across-runs detection
Type-filtered downloads: images, documents, pdf, archives, media, scripts, config (plus literal extensions)
`--respect-robots` flag honours Disallow rules during crawl
`--tor` routing via SOCKS5 127.0.0.1:9050 (operator anonymity)
Four export formats: text, JSON, CSV, Markdown
Dry-run mode for download (preview without fetching)
Zero runtime dependencies — single static binary

Requirements

Runtime

Linux x86_64 (zero runtime deps)

Dependencies

Zero third-party dependencies — Python standard library only.

Usage

webharvest <command> [OPTIONS]

Subcommands

Command	Description
`links`	Extract and export all links from target
`download`	Download files matching type filters
`crawl`	Recursive discovery + optional download
`report`	Full site report (links, assets, externals, forms)

Type Groups

Group	Extensions
`images`	.jpg .jpeg .png .gif .bmp .svg .webp .ico .tiff .avif
`pdf`	.pdf
`office`	.doc .docx .xls .xlsx .ppt .pptx .odt .ods .odp .rtf
`archives`	.zip .tar .gz .bz2 .xz .7z .rar .tgz
`scripts`	.js .mjs .ts .py .sh .ps1 .bat .rb .php
`stylesheets`	.css .scss .less
`fonts`	.woff .woff2 .ttf .otf .eot
`videos`	.mp4 .webm .avi .mkv .mov .flv .wmv
`audio`	.mp3 .wav .ogg .flac .aac .m4a

Global Options

Flag	Description
`--workers, -w`	Concurrent workers (default: 4)
`--delay`	Delay between requests in seconds
`--timeout`	Request timeout in seconds (default: 30)
`--max-size`	Max file size in bytes (0 = unlimited)
`--user-agent`	Custom User-Agent string
`--headers`	Extra headers (semicolon-separated)
`--cookies`	Cookies as name=value pairs
`--auth`	Basic auth as user:pass
`--no-verify-ssl`	Disable SSL verification
`--ignore-robots`	Ignore robots.txt
`--include`	Regex: only process matching URLs
`--exclude`	Regex: skip matching URLs
`--verbose, -v`	Verbose output
`--quiet, -q`	Suppress progress output
`--json`	JSON output to stdout

Examples

Extract all links

$ webharvest links https://example.com --format json
[
  {"url": "https://example.com/about", "tag": "a", "attr": "href", "context": "About us"},
  {"url": "https://example.com/logo.png", "tag": "img", "attr": "src", "context": ""}
]

Download all images and PDFs

$ webharvest download https://target.com --types images,pdf --output ./loot/
[webharvest] Crawling https://target.com (depth=1) for: .jpg, .jpeg, .pdf, .png ...
[webharvest] Found 14 files to download
[download] https://target.com/report.pdf
           -> ./loot/pdf/report.pdf (2.4 MB) [a1b2c3d4e5f6]
[download] https://target.com/img/logo.png
           -> ./loot/images/logo.png (45.2 KB) [f6e5d4c3b2a1]
...

Recursive crawl with depth

$ webharvest crawl https://target.com --depth 3 --types .pdf,.docx --dry-run
[webharvest] Crawling https://target.com (depth=3)
[crawl] https://target.com
[crawl]   https://target.com/docs/
[crawl]     https://target.com/docs/guide/
[dry-run] Would download: https://target.com/docs/manual.pdf
[dry-run] Would download: https://target.com/docs/guide/setup.docx

Full site report

$ webharvest report https://target.com --format markdown -o report.md
[webharvest] Generating report for https://target.com (depth=2)
[webharvest] Writing report to report.md

With authentication and rate limiting

$ webharvest download https://intranet.corp.local \
    --auth admin:secret \
    --types office,pdf \
    --delay 1.0 \
    --ignore-robots \
    --no-verify-ssl

Install Layout

# Standard layout for all cli.johlem.net tools
~/.local/lib/webharvest/                       # Tool source files
~/.local/bin/webharvest                        # Executable wrapper
~/.local/log/cli.johlem.net/webharvest_*.log   # Install/uninstall logs

After install, run webharvest --help (requires ~/.local/bin in your PATH).

Integrity

Verify your download against these SHA256 checksums:

File	SHA256

This tool is provided as-is with no warranty. Use at your own risk. Always review scripts before running them. Not responsible for any damage or data loss. Intended for authorized security testing, research, and educational purposes only.