TextHarvest
Batch text extraction toolkit
TextHarvest is a modernised CLI toolkit for automating document and source code processing on Linux and macOS. It streamlines batch text extraction workflows through a unified command-line interface.
Features
- Direct text extraction from PDFs via Poppler
- OCR support for scanned documents using Tesseract
- Multi-language OCR capabilities
- Generate consolidated source code listings from project directories
- Parallel processing with configurable job limits
- Interactive file selection mode
- Dry-run preview before execution
- Progress tracking and real-time status updates
Tech Stack
Bash Poppler Tesseract OCR ocrmypdf
Use Cases
TextHarvest is particularly useful for AI training data preparation, research workflows, archive digitisation, CI/CD integration, and content indexing tasks. I built it to support my other projects that require converting documentation into formats suitable for LLM consumption.
Future Ideas
Add support for Microsoft Office documents (Word, Excel, PowerPoint) and other common formats to make TextHarvest more versatile for enterprise document processing workflows.