lexharvest

Ranked wordlist generator with CeWL-style mutations + pentest boost

v1.0.1

Linux

Quick Start

Install via jcli

jcli install lexharvest

Build a wordlist

# TF-IDF scored wordlist from a directory tree
lexharvest docs/ --top 200 -o wordlist.txt

# BM25 + JSON
lexharvest *.pdf -m bm25 -f json -o report.json

# Scrape a staff page over Tor and apply CeWL-style mutations
lexharvest --tor --mutate https://example.com/about --top 50 -o spray.txt

# Pentest-boosted single-doc with stats
lexharvest --pentest-boost --stats notes.txt --top 30

# Stdin
cat *.txt | lexharvest - -m yake --top 20

# Hashcat wordlist + precomputed sha256
lexharvest -f hashcat docs/ -o wl.txt
lexharvest hash-keywords --in wl.txt --algo sha256 -o wl.sha256

What it does

Generates ranked dictionaries and wordlists from target content (staff pages, leak dumps, docs, blog posts). Every score knob the Python original had is preserved (TF-IDF / BM25 / YAKE!, FP-Growth phrase mining, Porter stemming, stopword filtering). The Rust port adds:

CeWL-style mutation engine via --mutate: case variants, leetspeak (a→@, e→3, i→1, o→0, s→$), digit append (0–9 + common two-digit pairs), ±2 years around the current year, and symbol suffixes (!, @, #, 123, …). ~30 mutations per base token, deduped.
Pentest seed boost via --pentest-boost: ~80 admin / auth / cred / network tokens get a score multiplier so security-relevant words float to the top regardless of raw frequency.
Markdown CommonMark parser — the Python original treated .md as plain text; pulldown-cmark strips syntax and gives clean prose.
Tokio concurrent URL fetching with --workers — the Python build was constrained by the GIL; Rust async overlaps the I/O.
Tor SOCKS5 routing with --tor for URL fetches.
Hashcat-compatible output via -f hashcat — one word per line, ready to pipe into hashcat -a 0.
Markdown report output for engagement deliverables.
hash-keywords subcommand emits md5 / sha256 of a wordlist (one hash:word per line) for rainbow-table matching or hashcat --show workflows.
Stdin input (-) for pipelining without temp files.

Scoring

All three methods ported byte-for-byte from the Python implementation:

TF-IDF: term frequency over corpus × log((N+1)/(df+1)) + 1. Output column score.
BM25: classic Robertson/Spärck-Jones with k1=1.5, b=0.75. Term-frequency saturation built in.
YAKE!: single-document unsupervised extraction. Lower score is better.

Phrase mining: --phrases turns on FP-Growth over a sliding window (default size 4) with minimum support 2. Emit-after-scoring so phrases don't pollute the term list.

CLI

Flag	What it does
`-m, --method <tfidf\|bm25\|yake>`	Scoring method (default: tfidf)
`--top <N>`	Limit to top N terms (0 = all)
`--pentest-boost`	Boost ~80 security-relevant tokens
`--mutate`	Apply CeWL-style mutation expansion
`--stem`	Porter stemming
`--phrases`	FP-Growth multi-word patterns
`--min-length` / `--max-length`	Token length bounds (2..64 default)
`--no-numbers` / `--keep-stopwords`	Filter toggles
`-f, --format <txt\|csv\|json\|markdown\|hashcat>`	Output format
`-o, --output <FILE>`	Output path (default: stdout)
`-w, --workers <N>`	Concurrent URL workers (default: 4)
`--tor`	SOCKS5 127.0.0.1:9050 for URL fetches
`--timeout <SECS>`	Per-URL timeout (default: 30)
`--stats`	Pretty-print top-20 to stderr
`-q, --quiet`	Suppress progress messages
`hash-keywords --in <F> --algo md5\|sha256`	Hash a wordlist (one hash:word per line)