TextHarvest
Batch text extraction toolkit
TextHarvest is a CLI toolkit for batch text extraction from documents and source code on Linux and macOS. I built it to convert documentation into formats suitable for LLM consumption across my other projects.
Features
- Direct text extraction from PDFs via Poppler
- OCR support for scanned documents using Tesseract
- Multi-language OCR capabilities
- Consolidated source code listings from project directories
- Parallel processing with configurable job limits
- Interactive file selection mode
- Dry-run preview before execution
- Progress tracking and real-time status updates
Tech Stack
Bash Poppler Tesseract OCR ocrmypdf