feat: brain PDF/image text extraction — pymupdf + tesseract OCR + vision API

- PDF: extracts selectable text via pymupdf, falls back to Tesseract OCR for scanned docs - PDF: renders first page as screenshot thumbnail - Images: Tesseract OCR for text extraction, OpenAI vision API fallback for photos - Plain text files: direct decode - All extracted text stored in extracted_text field for search/embedding - Tested: PDF upload → text extracted → AI classified → searchable New deps: pymupdf, pytesseract, Pillow System dep: tesseract-ocr added to both Dockerfiles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:49:04 -05:00
parent 2c3f0d263b
commit b179386a57
6 changed files with 286 additions and 2 deletions
--- a/services/brain/storage/a66d0767-a59e-4ede-878d-f74441e42830/original_upload/test-insurance.pdf
+++ b/services/brain/storage/a66d0767-a59e-4ede-878d-f74441e42830/original_upload/test-insurance.pdf
@@ -0,0 +1,19 @@
+%PDF-1.0
+1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
+2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj
+3 0 obj<</Type/Page/MediaBox[0 0 612 792]/Parent 2 0 R/Resources<</Font<</F1 4 0 R>>>>/Contents 5 0 R>>endobj
+4 0 obj<</Type/Font/Subtype/Type1/BaseFont/Helvetica>>endobj
+5 0 obj<</Length 200>>
+stream
+BT /F1 18 Tf 72 700 Td (State Farm Insurance Policy) Tj ET
+BT /F1 12 Tf 72 670 Td (Policy Number: SF-2024-881234) Tj ET
+BT /F1 12 Tf 72 650 Td (Policyholder: Yusuf Suleman) Tj ET
+BT /F1 12 Tf 72 630 Td (Deductible: 500 dollars) Tj ET
+endstream
+endobj
+xref
+0 6
+trailer<</Size 6/Root 1 0 R>>
+startxref
+0
+%%EOF