- PDF: extracts selectable text via pymupdf, falls back to Tesseract OCR for scanned docs - PDF: renders first page as screenshot thumbnail - Images: Tesseract OCR for text extraction, OpenAI vision API fallback for photos - Plain text files: direct decode - All extracted text stored in extracted_text field for search/embedding - Tested: PDF upload → text extracted → AI classified → searchable New deps: pymupdf, pytesseract, Pillow System dep: tesseract-ocr added to both Dockerfiles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
24 lines
743 B
Docker
24 lines
743 B
Docker
FROM python:3.12-slim
|
|
|
|
WORKDIR /app
|
|
|
|
RUN apt-get update && apt-get install -y --no-install-recommends libpq-dev tesseract-ocr tesseract-ocr-eng && rm -rf /var/lib/apt/lists/*
|
|
RUN pip install --no-cache-dir --upgrade pip
|
|
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
RUN adduser --disabled-password --no-create-home appuser
|
|
RUN mkdir -p /app/storage && chown -R appuser /app/storage
|
|
|
|
COPY --chown=appuser app/ app/
|
|
|
|
EXPOSE 8200
|
|
ENV PYTHONUNBUFFERED=1
|
|
|
|
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
|
|
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8200/api/health', timeout=3)" || exit 1
|
|
|
|
USER appuser
|
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8200"]
|