Files
platform/services/brain/Dockerfile.worker
Yusuf Suleman b179386a57 feat: brain PDF/image text extraction — pymupdf + tesseract OCR + vision API
- PDF: extracts selectable text via pymupdf, falls back to Tesseract OCR for scanned docs
- PDF: renders first page as screenshot thumbnail
- Images: Tesseract OCR for text extraction, OpenAI vision API fallback for photos
- Plain text files: direct decode
- All extracted text stored in extracted_text field for search/embedding
- Tested: PDF upload → text extracted → AI classified → searchable

New deps: pymupdf, pytesseract, Pillow
System dep: tesseract-ocr added to both Dockerfiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 18:49:04 -05:00

20 lines
570 B
Docker

FROM python:3.12-slim
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends libpq-dev tesseract-ocr tesseract-ocr-eng && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir --upgrade pip
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN adduser --disabled-password --no-create-home appuser
RUN mkdir -p /app/storage && chown -R appuser /app/storage
COPY --chown=appuser app/ app/
ENV PYTHONUNBUFFERED=1
USER appuser
CMD ["rq", "worker", "brain", "--url", "redis://brain-redis:6379/0", "--path", "/app"]