Document Processing

Eneo’s document processing capabilities allow you to create AI-powered knowledge bases from your documents and websites. This guide covers uploading documents, web crawling, and optimizing retrieval.

Overview

Eneo can process and extract knowledge from:

Documents: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .csv)
Websites: Automated crawling and content extraction
Text files: Plain text, Markdown

All content is:

Extracted from source format
Chunked into optimal segments
Embedded as vectors for semantic search
Stored in PostgreSQL with pgvector
Retrieved when relevant to queries

Uploading Documents

Step 1: Create a Space

Documents belong to collaborative spaces in Eneo:

Log in to your Eneo instance
Click Spaces in the sidebar
Click Create Space
Fill in:
- Name: Your space name
- Description: What the space is for
- Visibility: Private or Team
Click Create

Step 2: Navigate to Knowledge Base

Open your space
Click the Knowledge tab
Click Add Documents

Step 3: Upload Files

Click Upload Files or drag and drop
Select one or multiple files
Click Upload

The worker service will process files in the background. You’ll see:

✓ Uploaded: File received
⏳ Processing: Content extraction in progress
✓ Completed: Ready for queries

Supported Formats

Format	Extension	Notes
PDF	`.pdf`	Text-based PDFs work best; OCR not yet supported
Word	`.docx`	Modern Word documents
PowerPoint	`.pptx`	Extracts text from slides
Excel	`.xlsx`, `.csv`	Extracts data as text
Text	`.txt`, `.md`	Plain text and Markdown

File Size Limits

Default maximum file size: 50MB per file

To increase limits, add to env_backend.env:


MAX_UPLOAD_SIZE_MB=100

Then restart:


docker compose restart backend

Web Crawling

Automatically extract content from websites.

Step 1: Add a Web Source

In your space, go to Knowledge tab
Click Add Web Source
Enter the URL to crawl
Configure crawling options:
- Max Depth: How many levels to crawl (default: 2)
- Max Pages: Maximum pages to process (default: 50)
- Follow Links: Crawl linked pages
Click Start Crawling

Step 2: Monitor Progress

The crawler will:

Fetch the initial page
Extract links (if configured)
Process each page in the background
Update progress in real-time

Crawling Configuration

Shallow crawl (single page):


URL: https://example.com/documentation
Max Depth: 0
Max Pages: 1

Deep crawl (entire section):


URL: https://example.com/docs
Max Depth: 3
Max Pages: 100
Follow Links: Yes

Web Crawling Best Practices

Start small: Test with single pages first
Respect limits: Don’t crawl too many pages at once
Check robots.txt: Ensure crawling is allowed
Use specific URLs: Target documentation sections, not entire websites

Handling Authentication

For authenticated websites, use the API:


import requests
 
response = requests.post(
    "https://your-eneo-instance.com/api/spaces/123/knowledge/web",
    json={
        "url": "https://example.com/docs",
        "max_depth": 2,
        "headers": {
            "Authorization": "Bearer your-token"
        }
    },
    headers={"Authorization": "Bearer your-eneo-api-key"}
)

How Document Processing Works

1. Extraction

Content is extracted from source formats:

PDF: Text extraction using pdfplumber
Word: XML parsing of .docx structure
PowerPoint: Slide text extraction
Excel: Cell data extraction
Web: HTML parsing with Scrapy

2. Chunking

Text is split into chunks for optimal retrieval:

Default chunk size: 1000 characters with 200 character overlap

Overlap ensures context isn’t lost at boundaries:


Chunk 1: [..................] (chars 0-1000)
Chunk 2:           [..................] (chars 800-1800)
Chunk 3:                      [..................] (chars 1600-2600)

3. Embedding

Each chunk is converted to a vector embedding:

Model: Configurable embedding model (default: OpenAI text-embedding-3-small)
Dimensions: 1536 dimensions (OpenAI) or model-specific
Storage: Stored in PostgreSQL with pgvector extension

4. Retrieval

When users query:

Query is converted to an embedding
Vector similarity search finds relevant chunks
Top-k most relevant chunks are retrieved (default: k=5)
Chunks are provided as context to the AI model

Optimizing Retrieval

Chunk Size Configuration

Adjust chunk size based on your content:

Large chunks (1500-2000 chars):

✓ Better for long-form content
✓ More context per chunk
✗ Less precise retrieval

Small chunks (500-800 chars):

✓ More precise retrieval
✓ Better for factual Q&A
✗ May lose context

To configure, add to env_backend.env:


CHUNK_SIZE=1000
CHUNK_OVERLAP=200

Embedding Models

Choose the right embedding model:

OpenAI text-embedding-3-small (default):

Fast and cost-effective
1536 dimensions
Good for general purpose

OpenAI text-embedding-3-large:

Higher quality
3072 dimensions
Better for complex queries

Configure in env_backend.env:


EMBEDDING_MODEL=text-embedding-3-small

Retrieval Parameters

Adjust how many chunks are retrieved:


# Number of chunks to retrieve
RETRIEVAL_K=5
 
# Minimum similarity threshold (0-1)
RETRIEVAL_THRESHOLD=0.7

Higher k = more context but potentially less relevant Higher threshold = more precise but may miss relevant content

Managing Your Knowledge Base

View Documents

Go to your space
Click Knowledge tab
See all documents and their status

Search Documents

Use the search bar to find specific documents:


Search: "quarterly report"

Update Documents

To update a document:

Delete the old version
Upload the new version

Or use versioning:

Keep both versions
Mark old version as archived

Delete Documents

Click the document
Click Delete
Confirm deletion

Deletion removes:

Original document
All chunks
All embeddings

Performance Optimization

Processing Speed

Worker configuration in docker-compose.yml:


worker:
  deploy:
    replicas: 2  # Run 2 worker instances
    resources:
      limits:
        cpus: '2'
        memory: 2G

Database Performance

Create indexes for faster queries:


-- Already included in migrations, but if needed:
CREATE INDEX idx_document_chunks_embedding
ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Optimize PostgreSQL in env_db.env:


# Increase shared buffers
POSTGRES_SHARED_BUFFERS=256MB
 
# Increase work memory
POSTGRES_WORK_MEM=16MB

Caching

Enable Redis caching for embeddings:


# In env_backend.env
CACHE_EMBEDDINGS=true
CACHE_TTL_SECONDS=3600

Troubleshooting

Files Not Processing

Check worker status:


docker compose ps worker
docker compose logs worker

Common issues:

Worker not running
Insufficient memory
Unsupported file format
Corrupted file

Solution:


docker compose restart worker

Web Crawling Fails

Check error messages:


docker compose logs worker | grep crawl

Common issues:

Website blocks crawlers (check robots.txt)
Authentication required
Rate limiting
Network issues

Solution:

Use specific URLs instead of root domains
Reduce max_pages and max_depth
Add delays between requests

Poor Retrieval Quality

Symptoms:

AI doesn’t use uploaded documents
Irrelevant chunks retrieved
Missing important information

Solutions:

Check chunk size:
- Too large = less precise
- Too small = loses context

Adjust retrieval parameters:


RETRIEVAL_K=10  # Retrieve more chunks
RETRIEVAL_THRESHOLD=0.6  # Lower threshold

Use better embeddings:
```
EMBEDDING_MODEL=text-embedding-3-large
```
Improve document quality:
- Use text-based PDFs (not scanned images)
- Remove headers/footers
- Clean up formatting

Out of Memory Errors

Reduce worker memory usage:


# docker-compose.yml
worker:
  deploy:
    resources:
      limits:
        memory: 1G  # Reduce from 2G

Process fewer documents simultaneously:


# env_backend.env
MAX_CONCURRENT_TASKS=2  # Reduce from default

Advanced Use Cases

Batch Upload via API

Upload multiple files programmatically:


import requests
 
files = [
    ('files', open('doc1.pdf', 'rb')),
    ('files', open('doc2.pdf', 'rb')),
    ('files', open('doc3.pdf', 'rb')),
]
 
response = requests.post(
    'https://your-eneo-instance.com/api/spaces/123/knowledge/upload',
    files=files,
    headers={'Authorization': 'Bearer your-api-key'}
)

Custom Chunking Strategies

For specialized content, implement custom chunking:


# In your fork of Eneo backend
from app.services.chunking import ChunkingStrategy
 
class CustomChunker(ChunkingStrategy):
    def chunk(self, text: str) -> list[str]:
        # Your custom logic
        return chunks

Multi-language Support

Process documents in multiple languages:


# env_backend.env
EMBEDDING_MODEL=multilingual-e5-large  # Multi-language model

Best Practices

Organize by topic: Create separate spaces for different topics
Use descriptive names: Name documents clearly
Keep documents updated: Regularly refresh content
Monitor storage: Check disk usage periodically
Test retrieval: Verify documents are being used in responses
Start small: Test with a few documents before bulk upload
Clean data: Remove unnecessary content before upload

Need Help?

Processing issues: Check troubleshooting docs
API reference: API documentation
GitHub issues: Report bugs
Email support: digitalisering@sundsvall.se (public sector organizations)