TextHarvest | Matthew Deaves

View Repository

TextHarvest is a CLI toolkit for batch text extraction from documents and source code on Linux and macOS. I built it to convert documentation into formats suitable for LLM consumption across my other projects.

# Features

Direct text extraction from PDFs via Poppler
OCR support for scanned documents using Tesseract
Multi-language OCR capabilities
Consolidated source code listings from project directories
Parallel processing with configurable job limits
Interactive file selection mode
Dry-run preview before execution
Progress tracking and real-time status updates

# Tech Stack

Bash Poppler Tesseract OCR ocrmypdf