TextHarvest is a modernised CLI toolkit for automating document and source code processing on Linux and macOS. It streamlines batch text extraction workflows through a unified command-line interface.

Features

  • Direct text extraction from PDFs via Poppler
  • OCR support for scanned documents using Tesseract
  • Multi-language OCR capabilities
  • Generate consolidated source code listings from project directories
  • Parallel processing with configurable job limits
  • Interactive file selection mode
  • Dry-run preview before execution
  • Progress tracking and real-time status updates

Tech Stack

Bash Poppler Tesseract OCR ocrmypdf

Use Cases

TextHarvest is particularly useful for AI training data preparation, research workflows, archive digitisation, CI/CD integration, and content indexing tasks. I built it to support my other projects that require converting documentation into formats suitable for LLM consumption.

Future Ideas

Add support for Microsoft Office documents (Word, Excel, PowerPoint) and other common formats to make TextHarvest more versatile for enterprise document processing workflows.