TextHarvest is a CLI toolkit for batch text extraction from documents and source code on Linux and macOS. I built it to convert documentation into formats suitable for LLM consumption across my other projects.

Features

  • Direct text extraction from PDFs via Poppler
  • OCR support for scanned documents using Tesseract
  • Multi-language OCR capabilities
  • Consolidated source code listings from project directories
  • Parallel processing with configurable job limits
  • Interactive file selection mode
  • Dry-run preview before execution
  • Progress tracking and real-time status updates

Tech Stack

Bash Poppler Tesseract OCR ocrmypdf