WebHarvest
Web asset extractor and link harvester for OSINT and reconnaissance
v1.0.1
Linux
Quick Start
Install
curl -fsSL https://cli.johlem.net/install.sh | bash -s -- webharvest
Uninstall
curl -fsSL https://cli.johlem.net/uninstall.sh | bash -s -- webharvest
NixOS / Nix
nix profile install "tarball+https://cli.johlem.net/releases/cli-johlem-net-latest.tar.gz#webharvest"
Features
- Four subcommands: links, crawl, download, report
- Link extraction: anchors, images, scripts, stylesheets, iframes, media
- JS URL extraction — pull API endpoints out of inline `<script>` bodies
- robots.txt + sitemap.xml auto-discovery on every run
- HTML comment + meta-generator fingerprint (WordPress version, generator, framework)
- Form discovery — action / method / field names / has-password-field flag
- Recursive same-origin crawl by default; `--include-external` opt-in
- Concurrent fetches with `--concurrency N` (tokio semaphore)
- Asset download with SHA-256 hashing for diff-across-runs detection
- Type-filtered downloads: images, documents, pdf, archives, media, scripts, config (plus literal extensions)
- `--respect-robots` flag honours Disallow rules during crawl
- `--tor` routing via SOCKS5 127.0.0.1:9050 (operator anonymity)
- Four export formats: text, JSON, CSV, Markdown
- Dry-run mode for download (preview without fetching)
- Zero runtime dependencies — single static binary
Requirements
Runtime
Linux x86_64 (zero runtime deps)
Dependencies
Zero third-party dependencies — Python standard library only.
Usage
webharvest <command> [OPTIONS]
Subcommands
| Command | Description |
|---|---|
links | Extract and export all links from target |
download | Download files matching type filters |
crawl | Recursive discovery + optional download |
report | Full site report (links, assets, externals, forms) |
Type Groups
| Group | Extensions |
|---|---|
images | .jpg .jpeg .png .gif .bmp .svg .webp .ico .tiff .avif |
pdf | |
office | .doc .docx .xls .xlsx .ppt .pptx .odt .ods .odp .rtf |
archives | .zip .tar .gz .bz2 .xz .7z .rar .tgz |
scripts | .js .mjs .ts .py .sh .ps1 .bat .rb .php |
stylesheets | .css .scss .less |
fonts | .woff .woff2 .ttf .otf .eot |
videos | .mp4 .webm .avi .mkv .mov .flv .wmv |
audio | .mp3 .wav .ogg .flac .aac .m4a |
Global Options
| Flag | Description |
|---|---|
--workers, -w | Concurrent workers (default: 4) |
--delay | Delay between requests in seconds |
--timeout | Request timeout in seconds (default: 30) |
--max-size | Max file size in bytes (0 = unlimited) |
--user-agent | Custom User-Agent string |
--headers | Extra headers (semicolon-separated) |
--cookies | Cookies as name=value pairs |
--auth | Basic auth as user:pass |
--no-verify-ssl | Disable SSL verification |
--ignore-robots | Ignore robots.txt |
--include | Regex: only process matching URLs |
--exclude | Regex: skip matching URLs |
--verbose, -v | Verbose output |
--quiet, -q | Suppress progress output |
--json | JSON output to stdout |
Examples
Extract all links
$ webharvest links https://example.com --format json
[
{"url": "https://example.com/about", "tag": "a", "attr": "href", "context": "About us"},
{"url": "https://example.com/logo.png", "tag": "img", "attr": "src", "context": ""}
]Download all images and PDFs
$ webharvest download https://target.com --types images,pdf --output ./loot/
[webharvest] Crawling https://target.com (depth=1) for: .jpg, .jpeg, .pdf, .png ...
[webharvest] Found 14 files to download
[download] https://target.com/report.pdf
-> ./loot/pdf/report.pdf (2.4 MB) [a1b2c3d4e5f6]
[download] https://target.com/img/logo.png
-> ./loot/images/logo.png (45.2 KB) [f6e5d4c3b2a1]
...Recursive crawl with depth
$ webharvest crawl https://target.com --depth 3 --types .pdf,.docx --dry-run
[webharvest] Crawling https://target.com (depth=3)
[crawl] https://target.com
[crawl] https://target.com/docs/
[crawl] https://target.com/docs/guide/
[dry-run] Would download: https://target.com/docs/manual.pdf
[dry-run] Would download: https://target.com/docs/guide/setup.docxFull site report
$ webharvest report https://target.com --format markdown -o report.md
[webharvest] Generating report for https://target.com (depth=2)
[webharvest] Writing report to report.mdWith authentication and rate limiting
$ webharvest download https://intranet.corp.local \
--auth admin:secret \
--types office,pdf \
--delay 1.0 \
--ignore-robots \
--no-verify-sslInstall Layout
# Standard layout for all cli.johlem.net tools
~/.local/lib/webharvest/ # Tool source files
~/.local/bin/webharvest # Executable wrapper
~/.local/log/cli.johlem.net/webharvest_*.log # Install/uninstall logs
After install, run webharvest --help (requires ~/.local/bin in your PATH).
Integrity
Verify your download against these SHA256 checksums:
| File | SHA256 |
|---|
This tool is provided as-is with no warranty. Use at your own risk. Always review scripts before running them. Not responsible for any damage or data loss. Intended for authorized security testing, research, and educational purposes only.