lexharvest

Ranked wordlist generator with CeWL-style mutations + pentest boost

v1.0.1
Linux

Quick Start

Install via jcli

jcli install lexharvest

Build a wordlist

# TF-IDF scored wordlist from a directory tree
lexharvest docs/ --top 200 -o wordlist.txt

# BM25 + JSON
lexharvest *.pdf -m bm25 -f json -o report.json

# Scrape a staff page over Tor and apply CeWL-style mutations
lexharvest --tor --mutate https://example.com/about --top 50 -o spray.txt

# Pentest-boosted single-doc with stats
lexharvest --pentest-boost --stats notes.txt --top 30

# Stdin
cat *.txt | lexharvest - -m yake --top 20

# Hashcat wordlist + precomputed sha256
lexharvest -f hashcat docs/ -o wl.txt
lexharvest hash-keywords --in wl.txt --algo sha256 -o wl.sha256

What it does

Generates ranked dictionaries and wordlists from target content (staff pages, leak dumps, docs, blog posts). Every score knob the Python original had is preserved (TF-IDF / BM25 / YAKE!, FP-Growth phrase mining, Porter stemming, stopword filtering). The Rust port adds:

Scoring

All three methods ported byte-for-byte from the Python implementation:

Phrase mining: --phrases turns on FP-Growth over a sliding window (default size 4) with minimum support 2. Emit-after-scoring so phrases don't pollute the term list.

CLI

FlagWhat it does
-m, --method <tfidf|bm25|yake>Scoring method (default: tfidf)
--top <N>Limit to top N terms (0 = all)
--pentest-boostBoost ~80 security-relevant tokens
--mutateApply CeWL-style mutation expansion
--stemPorter stemming
--phrasesFP-Growth multi-word patterns
--min-length / --max-lengthToken length bounds (2..64 default)
--no-numbers / --keep-stopwordsFilter toggles
-f, --format <txt|csv|json|markdown|hashcat>Output format
-o, --output <FILE>Output path (default: stdout)
-w, --workers <N>Concurrent URL workers (default: 4)
--torSOCKS5 127.0.0.1:9050 for URL fetches
--timeout <SECS>Per-URL timeout (default: 30)
--statsPretty-print top-20 to stderr
-q, --quietSuppress progress messages
hash-keywords --in <F> --algo md5|sha256Hash a wordlist (one hash:word per line)