TextHarvest

View Repository

TextHarvest is a modernised CLI toolkit for automating document and source code processing on Linux and macOS. It streamlines batch text extraction workflows through a unified command-line interface.

Features

Direct text extraction from PDFs via Poppler
OCR support for scanned documents using Tesseract
Multi-language OCR capabilities
Generate consolidated source code listings from project directories
Parallel processing with configurable job limits
Interactive file selection mode
Dry-run preview before execution
Progress tracking and real-time status updates

TextHarvest is particularly useful for AI training data preparation, research workflows, archive digitisation, CI/CD integration, and content indexing tasks. I built it to support my other projects that require converting documentation into formats suitable for LLM consumption.

Future Ideas

Add support for Microsoft Office documents (Word, Excel, PowerPoint) and other common formats to make TextHarvest more versatile for enterprise document processing workflows.

Features

Tech Stack

Use Cases

Future Ideas