feat: brain PDF/image text extraction — pymupdf + tesseract OCR + vision API

- PDF: extracts selectable text via pymupdf, falls back to Tesseract OCR for scanned docs
- PDF: renders first page as screenshot thumbnail
- Images: Tesseract OCR for text extraction, OpenAI vision API fallback for photos
- Plain text files: direct decode
- All extracted text stored in extracted_text field for search/embedding
- Tested: PDF upload → text extracted → AI classified → searchable

New deps: pymupdf, pytesseract, Pillow
System dep: tesseract-ocr added to both Dockerfiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Yusuf Suleman
2026-04-01 18:49:04 -05:00
parent 2c3f0d263b
commit b179386a57
6 changed files with 286 additions and 2 deletions

View File

@@ -0,0 +1,19 @@
%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj
3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R>>endobj
4 0 obj<</Type/Font/Subtype/Type1/BaseFont/Helvetica>>endobj
5 0 obj<</Length 200>>
stream
BT /F1 18 Tf 72 700 Td (State Farm Insurance Policy) Tj ET
BT /F1 12 Tf 72 670 Td (Policy Number: SF-2024-881234) Tj ET
BT /F1 12 Tf 72 650 Td (Policyholder: Yusuf Suleman) Tj ET
BT /F1 12 Tf 72 630 Td (Deductible: 500 dollars) Tj ET
endstream
endobj
xref
0 6
trailer<</Size 6/Root 1 0 R>>
startxref
0
%%EOF