WebHarvest

Web asset extractor and link harvester for OSINT and reconnaissance

v1.0.1
Linux

Quick Start

Install

curl -fsSL https://cli.johlem.net/install.sh | bash -s -- webharvest

Uninstall

curl -fsSL https://cli.johlem.net/uninstall.sh | bash -s -- webharvest

NixOS / Nix

nix profile install "tarball+https://cli.johlem.net/releases/cli-johlem-net-latest.tar.gz#webharvest"

Features

Requirements

Runtime

Linux x86_64 (zero runtime deps)

Dependencies

Zero third-party dependencies — Python standard library only.

Usage

webharvest <command> [OPTIONS]

Subcommands

CommandDescription
linksExtract and export all links from target
downloadDownload files matching type filters
crawlRecursive discovery + optional download
reportFull site report (links, assets, externals, forms)

Type Groups

GroupExtensions
images.jpg .jpeg .png .gif .bmp .svg .webp .ico .tiff .avif
pdf.pdf
office.doc .docx .xls .xlsx .ppt .pptx .odt .ods .odp .rtf
archives.zip .tar .gz .bz2 .xz .7z .rar .tgz
scripts.js .mjs .ts .py .sh .ps1 .bat .rb .php
stylesheets.css .scss .less
fonts.woff .woff2 .ttf .otf .eot
videos.mp4 .webm .avi .mkv .mov .flv .wmv
audio.mp3 .wav .ogg .flac .aac .m4a

Global Options

FlagDescription
--workers, -wConcurrent workers (default: 4)
--delayDelay between requests in seconds
--timeoutRequest timeout in seconds (default: 30)
--max-sizeMax file size in bytes (0 = unlimited)
--user-agentCustom User-Agent string
--headersExtra headers (semicolon-separated)
--cookiesCookies as name=value pairs
--authBasic auth as user:pass
--no-verify-sslDisable SSL verification
--ignore-robotsIgnore robots.txt
--includeRegex: only process matching URLs
--excludeRegex: skip matching URLs
--verbose, -vVerbose output
--quiet, -qSuppress progress output
--jsonJSON output to stdout

Examples

Extract all links

$ webharvest links https://example.com --format json
[
  {"url": "https://example.com/about", "tag": "a", "attr": "href", "context": "About us"},
  {"url": "https://example.com/logo.png", "tag": "img", "attr": "src", "context": ""}
]

Download all images and PDFs

$ webharvest download https://target.com --types images,pdf --output ./loot/
[webharvest] Crawling https://target.com (depth=1) for: .jpg, .jpeg, .pdf, .png ...
[webharvest] Found 14 files to download
[download] https://target.com/report.pdf
           -> ./loot/pdf/report.pdf (2.4 MB) [a1b2c3d4e5f6]
[download] https://target.com/img/logo.png
           -> ./loot/images/logo.png (45.2 KB) [f6e5d4c3b2a1]
...

Recursive crawl with depth

$ webharvest crawl https://target.com --depth 3 --types .pdf,.docx --dry-run
[webharvest] Crawling https://target.com (depth=3)
[crawl] https://target.com
[crawl]   https://target.com/docs/
[crawl]     https://target.com/docs/guide/
[dry-run] Would download: https://target.com/docs/manual.pdf
[dry-run] Would download: https://target.com/docs/guide/setup.docx

Full site report

$ webharvest report https://target.com --format markdown -o report.md
[webharvest] Generating report for https://target.com (depth=2)
[webharvest] Writing report to report.md

With authentication and rate limiting

$ webharvest download https://intranet.corp.local \
    --auth admin:secret \
    --types office,pdf \
    --delay 1.0 \
    --ignore-robots \
    --no-verify-ssl

Install Layout

# Standard layout for all cli.johlem.net tools
~/.local/lib/webharvest/                       # Tool source files
~/.local/bin/webharvest                        # Executable wrapper
~/.local/log/cli.johlem.net/webharvest_*.log   # Install/uninstall logs

After install, run webharvest --help (requires ~/.local/bin in your PATH).

Integrity

Verify your download against these SHA256 checksums:

FileSHA256

This tool is provided as-is with no warranty. Use at your own risk. Always review scripts before running them. Not responsible for any damage or data loss. Intended for authorized security testing, research, and educational purposes only.