platform/services/brain/requirements.txt at 5098545580fbfb93c7f3bc8e428dd1ae368ef598 - platform - Gitea: Git with a cup of tea

yusiboyz/platform

Files

Yusuf Suleman b179386a57 feat: brain PDF/image text extraction — pymupdf + tesseract OCR + vision API

- PDF: extracts selectable text via pymupdf, falls back to Tesseract OCR for scanned docs
- PDF: renders first page as screenshot thumbnail
- Images: Tesseract OCR for text extraction, OpenAI vision API fallback for photos
- Plain text files: direct decode
- All extracted text stored in extracted_text field for search/embedding
- Tested: PDF upload → text extracted → AI classified → searchable

New deps: pymupdf, pytesseract, Pillow
System dep: tesseract-ocr added to both Dockerfiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-01 18:49:04 -05:00

15 lines

257 B

Plaintext

Raw Blame History

 fastapi==0.115.6
 uvicorn[standard]==0.34.0
 sqlalchemy[asyncio]==2.0.36
 asyncpg==0.30.0
 pgvector==0.3.6
 psycopg2-binary==2.9.10
 redis==5.2.1
 rq==2.1.0
 httpx==0.28.1
 pydantic==2.10.4
 python-multipart==0.0.20
 pymupdf==1.25.3
 pytesseract==0.3.13
 Pillow==11.1.0