lexharvest
Ranked wordlist generator with CeWL-style mutations + pentest boost
v1.0.1
Linux
Quick Start
Install via jcli
jcli install lexharvest
Build a wordlist
# TF-IDF scored wordlist from a directory tree
lexharvest docs/ --top 200 -o wordlist.txt
# BM25 + JSON
lexharvest *.pdf -m bm25 -f json -o report.json
# Scrape a staff page over Tor and apply CeWL-style mutations
lexharvest --tor --mutate https://example.com/about --top 50 -o spray.txt
# Pentest-boosted single-doc with stats
lexharvest --pentest-boost --stats notes.txt --top 30
# Stdin
cat *.txt | lexharvest - -m yake --top 20
# Hashcat wordlist + precomputed sha256
lexharvest -f hashcat docs/ -o wl.txt
lexharvest hash-keywords --in wl.txt --algo sha256 -o wl.sha256
What it does
Generates ranked dictionaries and wordlists from target content (staff pages, leak dumps, docs, blog posts). Every score knob the Python original had is preserved (TF-IDF / BM25 / YAKE!, FP-Growth phrase mining, Porter stemming, stopword filtering). The Rust port adds:
- CeWL-style mutation engine via
--mutate: case variants, leetspeak (a→@,e→3,i→1,o→0,s→$), digit append (0–9 + common two-digit pairs), ±2 years around the current year, and symbol suffixes (!,@,#,123, …). ~30 mutations per base token, deduped. - Pentest seed boost via
--pentest-boost: ~80 admin / auth / cred / network tokens get a score multiplier so security-relevant words float to the top regardless of raw frequency. - Markdown CommonMark parser — the Python original treated
.mdas plain text;pulldown-cmarkstrips syntax and gives clean prose. - Tokio concurrent URL fetching with
--workers— the Python build was constrained by the GIL; Rust async overlaps the I/O. - Tor SOCKS5 routing with
--torfor URL fetches. - Hashcat-compatible output via
-f hashcat— one word per line, ready to pipe intohashcat -a 0. - Markdown report output for engagement deliverables.
hash-keywordssubcommand emits md5 / sha256 of a wordlist (onehash:wordper line) for rainbow-table matching or hashcat--showworkflows.- Stdin input (
-) for pipelining without temp files.
Scoring
All three methods ported byte-for-byte from the Python implementation:
- TF-IDF: term frequency over corpus × log((N+1)/(df+1)) + 1. Output column
score. - BM25: classic Robertson/Spärck-Jones with k1=1.5, b=0.75. Term-frequency saturation built in.
- YAKE!: single-document unsupervised extraction. Lower score is better.
Phrase mining: --phrases turns on FP-Growth over a sliding window (default size 4) with minimum support 2. Emit-after-scoring so phrases don't pollute the term list.
CLI
| Flag | What it does |
|---|---|
-m, --method <tfidf|bm25|yake> | Scoring method (default: tfidf) |
--top <N> | Limit to top N terms (0 = all) |
--pentest-boost | Boost ~80 security-relevant tokens |
--mutate | Apply CeWL-style mutation expansion |
--stem | Porter stemming |
--phrases | FP-Growth multi-word patterns |
--min-length / --max-length | Token length bounds (2..64 default) |
--no-numbers / --keep-stopwords | Filter toggles |
-f, --format <txt|csv|json|markdown|hashcat> | Output format |
-o, --output <FILE> | Output path (default: stdout) |
-w, --workers <N> | Concurrent URL workers (default: 4) |
--tor | SOCKS5 127.0.0.1:9050 for URL fetches |
--timeout <SECS> | Per-URL timeout (default: 30) |
--stats | Pretty-print top-20 to stderr |
-q, --quiet | Suppress progress messages |
hash-keywords --in <F> --algo md5|sha256 | Hash a wordlist (one hash:word per line) |