- PDF: extracts selectable text via pymupdf, falls back to Tesseract OCR for scanned docs - PDF: renders first page as screenshot thumbnail - Images: Tesseract OCR for text extraction, OpenAI vision API fallback for photos - Plain text files: direct decode - All extracted text stored in extracted_text field for search/embedding - Tested: PDF upload → text extracted → AI classified → searchable New deps: pymupdf, pytesseract, Pillow System dep: tesseract-ocr added to both Dockerfiles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 lines
257 B
Plaintext
15 lines
257 B
Plaintext
fastapi==0.115.6
|
|
uvicorn[standard]==0.34.0
|
|
sqlalchemy[asyncio]==2.0.36
|
|
asyncpg==0.30.0
|
|
pgvector==0.3.6
|
|
psycopg2-binary==2.9.10
|
|
redis==5.2.1
|
|
rq==2.1.0
|
|
httpx==0.28.1
|
|
pydantic==2.10.4
|
|
python-multipart==0.0.20
|
|
pymupdf==1.25.3
|
|
pytesseract==0.3.13
|
|
Pillow==11.1.0
|