<project_specification>
  <project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>

  <overview>
    Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
    strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
    improve code maintainability, enable static type checking with mypy, and provide clear documentation
    for all functions, classes, and modules.

    The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
    semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
    for document upload, processing, and semantic search.
  </overview>

  <technology_stack>
    <backend>
      <runtime>Python 3.10+</runtime>
      <web_framework>Flask 3.0</web_framework>
      <vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
      <ocr>Mistral OCR API</ocr>
      <llm>Ollama (local) or Mistral API</llm>
      <type_checking>mypy with strict configuration</type_checking>
    </backend>
    <infrastructure>
      <containerization>Docker Compose (Weaviate + transformers)</containerization>
      <dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
    </infrastructure>
  </technology_stack>

  <current_state>
    <project_structure>
      - flask_app.py: Main Flask application (640 lines)
      - schema.py: Weaviate schema definition (383 lines)
      - utils/: 16+ modules for PDF processing pipeline
        - pdf_pipeline.py: Main orchestration (879 lines)
        - mistral_client.py: OCR API client
        - ocr_processor.py: OCR processing
        - markdown_builder.py: Markdown generation
        - llm_metadata.py: Metadata extraction via LLM
        - llm_toc.py: Table of contents extraction
        - llm_classifier.py: Section classification
        - llm_chunker.py: Semantic chunking
        - llm_cleaner.py: Chunk cleaning
        - llm_validator.py: Document validation
        - weaviate_ingest.py: Database ingestion
        - hierarchy_parser.py: Document hierarchy parsing
        - image_extractor.py: Image extraction from PDFs
        - toc_extractor*.py: Various TOC extraction methods
      - templates/: Jinja2 templates for Flask UI
      - tests/utils2/: Minimal test coverage (3 test files)
    </project_structure>

    <issues>
      - Inconsistent type annotations across modules (some have partial types, many have none)
      - Missing or incomplete docstrings (no Google-style format)
      - No mypy configuration for strict type checking
      - Type hints missing on function parameters and return values
      - Dict[str, Any] used extensively without proper typing
      - No type stubs for complex nested structures
    </issues>
  </current_state>

  <core_features>
    <type_annotations>
      <strict_typing>
        - Add complete type annotations to ALL functions and methods
        - Use proper generic types (List, Dict, Optional, Union) from typing module
        - Add TypedDict for complex dictionary structures
        - Add Protocol types for duck-typed interfaces
        - Use Literal types for string constants
        - Add ParamSpec and TypeVar where appropriate
        - Type all class attributes and instance variables
        - Add type annotations to lambda functions where possible
      </strict_typing>

      <mypy_configuration>
        - Create mypy.ini with strict configuration
        - Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
        - Enable: disallow_untyped_calls, disallow_untyped_decorators
        - Enable: warn_return_any, warn_redundant_casts
        - Enable: strict_equality, strict_optional
        - Set python_version to 3.10
        - Configure per-module overrides if needed for gradual migration
      </mypy_configuration>

      <type_stubs>
        - Create TypedDict definitions for common data structures:
          - OCR response structures
          - Metadata dictionaries
          - TOC entries
          - Chunk objects
          - Weaviate objects
          - Pipeline results
        - Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
        - Create Protocol types for callback functions
      </type_stubs>

      <specific_improvements>
        - pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
        - flask_app.py: Type all route handlers, request/response types
        - schema.py: Type Weaviate configuration objects
        - llm_*.py: Type LLM request/response structures
        - mistral_client.py: Type API client methods and responses
        - weaviate_ingest.py: Type ingestion functions and batch operations
      </specific_improvements>
    </type_annotations>

    <documentation>
      <google_style_docstrings>
        - Add comprehensive Google-style docstrings to ALL:
          - Module-level docstrings explaining purpose and usage
          - Class docstrings with Attributes section
          - Function/method docstrings with Args, Returns, Raises sections
          - Complex algorithm explanations with Examples section
        - Include code examples for public APIs
        - Document all exceptions that can be raised
        - Add Notes section for important implementation details
        - Add See Also section for related functions
      </google_style_docstrings>

      <module_documentation>
        <utils_modules>
          - pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
          - mistral_client.py: Document OCR API usage, cost calculation
          - llm_metadata.py: Document metadata extraction logic
          - llm_toc.py: Document TOC extraction strategies
          - llm_classifier.py: Document section classification types
          - llm_chunker.py: Document semantic vs basic chunking
          - llm_cleaner.py: Document cleaning rules and validation
          - llm_validator.py: Document validation criteria
          - weaviate_ingest.py: Document ingestion process, nested objects
          - hierarchy_parser.py: Document hierarchy building algorithm
        </utils_modules>

        <flask_app>
          - Document all routes with request/response examples
          - Document SSE (Server-Sent Events) implementation
          - Document Weaviate query patterns
          - Document upload processing workflow
          - Document background job management
        </flask_app>

        <schema>
          - Document Weaviate schema design decisions
          - Document each collection's purpose and relationships
          - Document nested object structure
          - Document vectorization strategy
        </schema>
      </module_documentation>

      <inline_comments>
        - Add inline comments for complex logic only (don't over-comment)
        - Explain WHY not WHAT (code should be self-documenting)
        - Document performance considerations
        - Document cost implications (OCR, LLM API calls)
        - Document error handling strategies
      </inline_comments>
    </documentation>

    <validation>
      <type_checking>
        - All modules must pass mypy --strict
        - No # type: ignore comments without justification
        - CI/CD should run mypy checks
        - Type coverage should be 100%
      </type_checking>

      <documentation_quality>
        - All public functions must have docstrings
        - All docstrings must follow Google style
        - Examples should be executable and tested
        - Documentation should be clear and concise
      </documentation_quality>
    </validation>
  </core_features>

  <implementation_priority>
    <critical_modules>
      Priority 1 (Most used, most complex):
      1. utils/pdf_pipeline.py - Main orchestration
      2. flask_app.py - Web application entry point
      3. utils/weaviate_ingest.py - Database operations
      4. schema.py - Schema definition

      Priority 2 (Core LLM modules):
      5. utils/llm_metadata.py
      6. utils/llm_toc.py
      7. utils/llm_classifier.py
      8. utils/llm_chunker.py
      9. utils/llm_cleaner.py
      10. utils/llm_validator.py

      Priority 3 (OCR and parsing):
      11. utils/mistral_client.py
      12. utils/ocr_processor.py
      13. utils/markdown_builder.py
      14. utils/hierarchy_parser.py
      15. utils/image_extractor.py

      Priority 4 (Supporting modules):
      16. utils/toc_extractor.py
      17. utils/toc_extractor_markdown.py
      18. utils/toc_extractor_visual.py
      19. utils/llm_structurer.py (legacy)
    </critical_modules>
  </implementation_priority>

  <implementation_steps>
    <feature_1>
      <title>Setup Type Checking Infrastructure</title>
      <description>
        Configure mypy with strict settings and create foundational type definitions
      </description>
      <tasks>
        - Create mypy.ini configuration file with strict settings
        - Add mypy to requirements.txt or dev dependencies
        - Create utils/types.py module for common TypedDict definitions
        - Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
        - Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
        - Create Protocol types for callbacks (ProgressCallback, etc.)
        - Document type definitions in utils/types.py module docstring
        - Test mypy configuration on a single module to verify settings
      </tasks>
      <acceptance_criteria>
        - mypy.ini exists with strict configuration
        - utils/types.py contains all foundational types with docstrings
        - mypy runs without errors on utils/types.py
        - Type definitions are comprehensive and reusable
      </acceptance_criteria>
    </feature_1>

    <feature_2>
      <title>Add Types to PDF Pipeline Orchestration</title>
      <description>
        Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
      </description>
      <tasks>
        - Add type annotations to all function signatures in pdf_pipeline.py
        - Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
        - Type progress_callback parameter with Protocol or Callable
        - Add TypedDict for pipeline options dictionary
        - Add TypedDict for pipeline result dictionary structure
        - Type all helper functions (extract_document_metadata_legacy, etc.)
        - Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
        - Fix any mypy errors that arise
        - Verify mypy --strict passes on pdf_pipeline.py
      </tasks>
      <acceptance_criteria>
        - All functions in pdf_pipeline.py have complete type annotations
        - progress_callback is properly typed with Protocol
        - All Dict[str, Any] replaced with TypedDict where appropriate
        - mypy --strict pdf_pipeline.py passes with zero errors
        - No # type: ignore comments (or justified if absolutely necessary)
      </acceptance_criteria>
    </feature_2>

    <feature_3>
      <title>Add Types to Flask Application</title>
      <description>
        Add complete type annotations to flask_app.py and type all routes
      </description>
      <tasks>
        - Add type annotations to all Flask route handlers
        - Type request.args, request.form, request.files usage
        - Type jsonify() return values
        - Type get_weaviate_client context manager
        - Type get_collection_stats, get_all_chunks, search_chunks functions
        - Add TypedDict for Weaviate query results
        - Type background job processing functions (run_processing_job)
        - Type SSE generator function (upload_progress)
        - Add type hints for template rendering
        - Verify mypy --strict passes on flask_app.py
      </tasks>
      <acceptance_criteria>
        - All Flask routes have complete type annotations
        - Request/response types are clear and documented
        - Weaviate query functions are properly typed
        - SSE generator is correctly typed
        - mypy --strict flask_app.py passes with zero errors
      </acceptance_criteria>
    </feature_3>

    <feature_4>
      <title>Add Types to Core LLM Modules</title>
      <description>
        Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
      </description>
      <tasks>
        - llm_metadata.py: Type extract_metadata function, return structure
        - llm_toc.py: Type extract_toc function, TOC hierarchy structure
        - llm_classifier.py: Type classify_sections, section types (Literal), validation functions
        - llm_chunker.py: Type chunk_section_with_llm, chunk objects
        - llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
        - llm_validator.py: Type validate_document, validation result structure
        - Add TypedDict for LLM request/response structures
        - Type provider selection ("ollama" | "mistral" as Literal)
        - Type model names with Literal or constants
        - Verify mypy --strict passes on all llm_*.py modules
      </tasks>
      <acceptance_criteria>
        - All LLM modules have complete type annotations
        - Section types use Literal for type safety
        - Provider and model parameters are strongly typed
        - LLM request/response structures use TypedDict
        - mypy --strict passes on all llm_*.py modules with zero errors
      </acceptance_criteria>
    </feature_4>

    <feature_5>
      <title>Add Types to Weaviate and Database Modules</title>
      <description>
        Add complete type annotations to schema.py and weaviate_ingest.py
      </description>
      <tasks>
        - schema.py: Type Weaviate configuration objects
        - schema.py: Type collection property definitions
        - weaviate_ingest.py: Type ingest_document function signature
        - weaviate_ingest.py: Type delete_document_chunks function
        - weaviate_ingest.py: Add TypedDict for Weaviate object structure
        - Type batch insertion operations
        - Type nested object references (work, document)
        - Add proper error types for Weaviate exceptions
        - Verify mypy --strict passes on both modules
      </tasks>
      <acceptance_criteria>
        - schema.py has complete type annotations for Weaviate config
        - weaviate_ingest.py functions are fully typed
        - Nested object structures use TypedDict
        - Weaviate client operations are properly typed
        - mypy --strict passes on both modules with zero errors
      </acceptance_criteria>
    </feature_5>

    <feature_6>
      <title>Add Types to OCR and Parsing Modules</title>
      <description>
        Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
      </description>
      <tasks>
        - mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
        - mistral_client.py: Add TypedDict for Mistral API response structures
        - ocr_processor.py: Type serialize_ocr_response, OCR object structures
        - markdown_builder.py: Type build_markdown, image_writer parameter
        - hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
        - hierarchy_parser.py: Add TypedDict for hierarchy node structure
        - image_extractor.py: Type create_image_writer, image handling
        - Verify mypy --strict passes on all modules
      </tasks>
      <acceptance_criteria>
        - All OCR/parsing modules have complete type annotations
        - Mistral API structures use TypedDict
        - Hierarchy nodes are properly typed
        - Image handling functions are typed
        - mypy --strict passes on all modules with zero errors
      </acceptance_criteria>
    </feature_6>

    <feature_7>
      <title>Add Google-Style Docstrings to Core Modules</title>
      <description>
        Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
      </description>
      <tasks>
        - pdf_pipeline.py: Add module docstring explaining the V2 pipeline
        - pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
        - pdf_pipeline.py: Document each of the 10 pipeline steps in comments
        - pdf_pipeline.py: Add Examples section showing typical usage
        - flask_app.py: Add module docstring explaining Flask application
        - flask_app.py: Document all routes with request/response examples
        - flask_app.py: Document Weaviate connection management
        - schema.py: Add module docstring explaining schema design
        - schema.py: Document each collection's purpose and relationships
        - weaviate_ingest.py: Document ingestion process with examples
        - All docstrings must follow Google style format exactly
      </tasks>
      <acceptance_criteria>
        - All core modules have comprehensive module-level docstrings
        - All public functions have Google-style docstrings
        - Args, Returns, Raises sections are complete and accurate
        - Examples are provided for complex functions
        - Docstrings explain WHY, not just WHAT
      </acceptance_criteria>
    </feature_7>

    <feature_8>
      <title>Add Google-Style Docstrings to LLM Modules</title>
      <description>
        Add comprehensive Google-style docstrings to all LLM processing modules
      </description>
      <tasks>
        - llm_metadata.py: Document metadata extraction logic with examples
        - llm_toc.py: Document TOC extraction strategies and fallbacks
        - llm_classifier.py: Document section types and classification criteria
        - llm_chunker.py: Document semantic vs basic chunking approaches
        - llm_cleaner.py: Document cleaning rules and validation logic
        - llm_validator.py: Document validation criteria and corrections
        - Add Examples sections showing input/output for each function
        - Document LLM provider differences (Ollama vs Mistral)
        - Document cost implications in Notes sections
        - All docstrings must follow Google style format exactly
      </tasks>
      <acceptance_criteria>
        - All LLM modules have comprehensive docstrings
        - Each function has Args, Returns, Raises sections
        - Examples show realistic input/output
        - Provider differences are documented
        - Cost implications are noted where relevant
      </acceptance_criteria>
    </feature_8>

    <feature_9>
      <title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
      <description>
        Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
      </description>
      <tasks>
        - mistral_client.py: Document OCR API usage, cost calculation
        - ocr_processor.py: Document OCR response processing
        - markdown_builder.py: Document markdown generation strategy
        - hierarchy_parser.py: Document hierarchy building algorithm
        - image_extractor.py: Document image extraction process
        - toc_extractor*.py: Document various TOC extraction methods
        - Add Examples sections for complex algorithms
        - Document edge cases and error handling
        - All docstrings must follow Google style format exactly
      </tasks>
      <acceptance_criteria>
        - All OCR/parsing modules have comprehensive docstrings
        - Complex algorithms are well explained
        - Edge cases are documented
        - Error handling is documented
        - Examples demonstrate typical usage
      </acceptance_criteria>
    </feature_9>

    <feature_10>
      <title>Final Validation and CI Integration</title>
      <description>
        Verify all type annotations and docstrings, integrate mypy into CI/CD
      </description>
      <tasks>
        - Run mypy --strict on entire codebase, verify 100% pass rate
        - Verify all public functions have docstrings
        - Check docstring formatting with pydocstyle or similar tool
        - Create GitHub Actions workflow to run mypy on every commit
        - Update README.md with type checking instructions
        - Update CLAUDE.md with documentation standards
        - Create CONTRIBUTING.md with type annotation and docstring guidelines
        - Generate API documentation with Sphinx or pdoc
        - Fix any remaining mypy errors or missing docstrings
      </tasks>
      <acceptance_criteria>
        - mypy --strict passes on entire codebase with zero errors
        - All public functions have Google-style docstrings
        - CI/CD runs mypy checks automatically
        - Documentation is generated and accessible
        - Contributing guidelines document type/docstring requirements
      </acceptance_criteria>
    </feature_10>
  </implementation_steps>

  <success_criteria>
    <type_safety>
      - 100% type coverage across all modules
      - mypy --strict passes with zero errors
      - No # type: ignore comments without justification
      - All Dict[str, Any] replaced with TypedDict where appropriate
      - Proper use of generics, protocols, and type variables
      - NewType used for semantic type safety
    </type_safety>

    <documentation_quality>
      - All modules have comprehensive module-level docstrings
      - All public functions/classes have Google-style docstrings
      - All docstrings include Args, Returns, Raises sections
      - Complex functions include Examples sections
      - Cost implications documented in Notes sections
      - Error handling clearly documented
      - Provider differences (Ollama vs Mistral) documented
    </documentation_quality>

    <code_quality>
      - Code is self-documenting with clear variable names
      - Inline comments explain WHY, not WHAT
      - Complex algorithms are well explained
      - Performance considerations documented
      - Security considerations documented
    </code_quality>

    <developer_experience>
      - IDE autocomplete works perfectly with type hints
      - Type errors caught at development time, not runtime
      - Documentation is easily accessible in IDE
      - API examples are executable and tested
      - Contributing guidelines are clear and comprehensive
    </developer_experience>

    <maintainability>
      - Refactoring is safer with type checking
      - Function signatures are self-documenting
      - API contracts are explicit and enforced
      - Breaking changes are caught by type checker
      - New developers can understand code quickly
    </maintainability>
  </success_criteria>

  <constraints>
    <compatibility>
      - Must maintain backward compatibility with existing code
      - Cannot break existing Flask routes or API contracts
      - Weaviate schema must remain unchanged
      - Existing tests must continue to pass
    </compatibility>

    <gradual_migration>
      - Can use per-module mypy configuration for gradual migration
      - Can temporarily disable strict checks on legacy modules
      - Priority modules must be completed first
      - Low-priority modules can be deferred
    </gradual_migration>

    <standards>
      - All type annotations must use Python 3.10+ syntax
      - Docstrings must follow Google style exactly (not NumPy or reStructuredText)
      - Use typing module (List, Dict, Optional) until Python 3.9 support dropped
      - Use from __future__ import annotations if needed for forward references
    </standards>
  </constraints>

  <testing_strategy>
    <type_checking>
      - Run mypy --strict on each module after adding types
      - Use mypy daemon (dmypy) for faster incremental checking
      - Add mypy to pre-commit hooks
      - CI/CD must run mypy and fail on type errors
    </type_checking>

    <documentation_validation>
      - Use pydocstyle to validate Google-style format
      - Use sphinx-build to generate docs and catch errors
      - Manual review of docstring examples
      - Verify examples are executable and correct
    </documentation_validation>

    <integration_testing>
      - Verify existing tests still pass after type additions
      - Add new tests for complex typed structures
      - Test mypy configuration on sample code
      - Verify IDE autocomplete works correctly
    </integration_testing>
  </testing_strategy>

  <documentation_examples>
    <module_docstring>
      ```python
      """
      PDF Pipeline V2 - Intelligent document processing with LLM enhancement.

      This module orchestrates a 10-step pipeline for processing PDF documents:
      1. OCR via Mistral API
      2. Markdown construction with images
      3. Metadata extraction via LLM
      4. Table of contents (TOC) extraction
      5. Section classification
      6. Semantic chunking
      7. Chunk cleaning and validation
      8. Enrichment with concepts
      9. Validation and corrections
      10. Ingestion into Weaviate vector database

      The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
      various processing modes (skip OCR, semantic chunking, OCR annotations).

      Typical usage:
          >>> from pathlib import Path
          >>> from utils.pdf_pipeline import process_pdf
          >>>
          >>> result = process_pdf(
          ...     Path("document.pdf"),
          ...     use_llm=True,
          ...     llm_provider="ollama",
          ...     ingest_to_weaviate=True,
          ... )
          >>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")

      See Also:
          mistral_client: OCR API client
          llm_metadata: Metadata extraction
          weaviate_ingest: Database ingestion
      """
      ```
    </module_docstring>

    <function_docstring>
      ```python
      def process_pdf_v2(
          pdf_path: Path,
          output_dir: Path = Path("output"),
          *,
          use_llm: bool = True,
          llm_provider: Literal["ollama", "mistral"] = "ollama",
          llm_model: Optional[str] = None,
          skip_ocr: bool = False,
          ingest_to_weaviate: bool = True,
          progress_callback: Optional[ProgressCallback] = None,
      ) -> PipelineResult:
          """
          Process a PDF through the complete V2 pipeline with LLM enhancement.

          This function orchestrates all 10 steps of the intelligent document processing
          pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
          cloud (Mistral API) LLM providers, with optional caching via skip_ocr.

          Args:
              pdf_path: Absolute path to the PDF file to process.
              output_dir: Base directory for output files. Defaults to "./output".
              use_llm: Enable LLM-based processing (metadata, TOC, chunking).
                  If False, uses basic heuristic processing.
              llm_provider: LLM provider to use. "ollama" for local (free but slow),
                  "mistral" for API (fast but paid).
              llm_model: Specific model name. If None, auto-detects based on provider
                  (qwen2.5:7b for ollama, mistral-small-latest for mistral).
              skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
                  Requires output_dir/<doc_name>/<doc_name>.md to exist.
              ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
              progress_callback: Optional callback for real-time progress updates.
                  Called with (step_id, status, detail) for each pipeline step.

          Returns:
              Dictionary containing processing results with the following keys:
                  - success (bool): True if processing completed without errors
                  - document_name (str): Name of the processed document
                  - pages (int): Number of pages in the PDF
                  - chunks_count (int): Number of chunks generated
                  - cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
                  - cost_llm (float): LLM API cost in euros (0 if provider=ollama)
                  - cost_total (float): Total cost (ocr + llm)
                  - metadata (dict): Extracted metadata (title, author, etc.)
                  - toc (list): Hierarchical table of contents
                  - files (dict): Paths to generated files (markdown, chunks, etc.)

          Raises:
              FileNotFoundError: If pdf_path does not exist.
              ValueError: If skip_ocr=True but markdown file not found.
              RuntimeError: If Weaviate connection fails during ingestion.

          Examples:
              Basic usage with Ollama (free):
              >>> result = process_pdf_v2(
              ...     Path("platon_menon.pdf"),
              ...     llm_provider="ollama"
              ... )
              >>> print(f"Cost: {result['cost_total']:.4f}€")
              Cost: 0.0270€  # OCR only

              With Mistral API (faster):
              >>> result = process_pdf_v2(
              ...     Path("platon_menon.pdf"),
              ...     llm_provider="mistral",
              ...     llm_model="mistral-small-latest"
              ... )

              Skip OCR to avoid cost:
              >>> result = process_pdf_v2(
              ...     Path("platon_menon.pdf"),
              ...     skip_ocr=True,  # Reuses existing markdown
              ...     ingest_to_weaviate=False
              ... )

          Notes:
              - OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
              - LLM cost: Free with Ollama, variable with Mistral API
              - Processing time: ~30s/page with Ollama, ~5s/page with Mistral
              - Weaviate must be running (docker-compose up -d) before ingestion
          """
      ```
    </function_docstring>
  </documentation_examples>
</project_specification>
