Document Processing
Eneo’s document processing capabilities allow you to create AI-powered knowledge bases from your documents and websites. This guide covers uploading documents, web crawling, and optimizing retrieval.
Overview
Eneo can process and extract knowledge from:
- Documents: PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx, .csv)
- Websites: Automated crawling and content extraction
- Text files: Plain text, Markdown
All content is:
- Extracted from source format
- Chunked into optimal segments
- Embedded as vectors for semantic search
- Stored in PostgreSQL with pgvector
- Retrieved when relevant to queries
Uploading Documents
Step 1: Create a Space
Documents belong to collaborative spaces in Eneo:
- Log in to your Eneo instance
- Click Spaces in the sidebar
- Click Create Space
- Fill in:
- Name: Your space name
- Description: What the space is for
- Visibility: Private or Team
- Click Create
Step 2: Navigate to Knowledge Base
- Open your space
- Click the Knowledge tab
- Click Add Documents
Step 3: Upload Files
- Click Upload Files or drag and drop
- Select one or multiple files
- Click Upload
The worker service will process files in the background. You’ll see:
- ✓ Uploaded: File received
- ⏳ Processing: Content extraction in progress
- ✓ Completed: Ready for queries
Supported Formats
| Format | Extension | Notes |
|---|---|---|
.pdf | Text-based PDFs work best; OCR not yet supported | |
| Word | .docx | Modern Word documents |
| PowerPoint | .pptx | Extracts text from slides |
| Excel | .xlsx, .csv | Extracts data as text |
| Text | .txt, .md | Plain text and Markdown |
File Size Limits
Default maximum file size: 50MB per file
To increase limits, add to env_backend.env:
MAX_UPLOAD_SIZE_MB=100Then restart:
docker compose restart backendWeb Crawling
Automatically extract content from websites.
Step 1: Add a Web Source
- In your space, go to Knowledge tab
- Click Add Web Source
- Enter the URL to crawl
- Configure crawling options:
- Max Depth: How many levels to crawl (default: 2)
- Max Pages: Maximum pages to process (default: 50)
- Follow Links: Crawl linked pages
- Click Start Crawling
Step 2: Monitor Progress
The crawler will:
- Fetch the initial page
- Extract links (if configured)
- Process each page in the background
- Update progress in real-time
Crawling Configuration
Shallow crawl (single page):
URL: https://example.com/documentation
Max Depth: 0
Max Pages: 1Deep crawl (entire section):
URL: https://example.com/docs
Max Depth: 3
Max Pages: 100
Follow Links: YesWeb Crawling Best Practices
- Start small: Test with single pages first
- Respect limits: Don’t crawl too many pages at once
- Check robots.txt: Ensure crawling is allowed
- Use specific URLs: Target documentation sections, not entire websites
Handling Authentication
For authenticated websites, use the API:
import requests
response = requests.post(
"https://your-eneo-instance.com/api/spaces/123/knowledge/web",
json={
"url": "https://example.com/docs",
"max_depth": 2,
"headers": {
"Authorization": "Bearer your-token"
}
},
headers={"Authorization": "Bearer your-eneo-api-key"}
)How Document Processing Works
1. Extraction
Content is extracted from source formats:
- PDF: Text extraction using pdfplumber
- Word: XML parsing of .docx structure
- PowerPoint: Slide text extraction
- Excel: Cell data extraction
- Web: HTML parsing with Scrapy
2. Chunking
Text is split into chunks for optimal retrieval:
Default chunk size: 1000 characters with 200 character overlap
Overlap ensures context isn’t lost at boundaries:
Chunk 1: [..................] (chars 0-1000)
Chunk 2: [..................] (chars 800-1800)
Chunk 3: [..................] (chars 1600-2600)3. Embedding
Each chunk is converted to a vector embedding:
- Model: Configurable embedding model (default: OpenAI text-embedding-3-small)
- Dimensions: 1536 dimensions (OpenAI) or model-specific
- Storage: Stored in PostgreSQL with pgvector extension
4. Retrieval
When users query:
- Query is converted to an embedding
- Vector similarity search finds relevant chunks
- Top-k most relevant chunks are retrieved (default: k=5)
- Chunks are provided as context to the AI model
Optimizing Retrieval
Chunk Size Configuration
Adjust chunk size based on your content:
Large chunks (1500-2000 chars):
- ✓ Better for long-form content
- ✓ More context per chunk
- ✗ Less precise retrieval
Small chunks (500-800 chars):
- ✓ More precise retrieval
- ✓ Better for factual Q&A
- ✗ May lose context
To configure, add to env_backend.env:
CHUNK_SIZE=1000
CHUNK_OVERLAP=200Embedding Models
Choose the right embedding model:
OpenAI text-embedding-3-small (default):
- Fast and cost-effective
- 1536 dimensions
- Good for general purpose
OpenAI text-embedding-3-large:
- Higher quality
- 3072 dimensions
- Better for complex queries
Configure in env_backend.env:
EMBEDDING_MODEL=text-embedding-3-smallRetrieval Parameters
Adjust how many chunks are retrieved:
# Number of chunks to retrieve
RETRIEVAL_K=5
# Minimum similarity threshold (0-1)
RETRIEVAL_THRESHOLD=0.7Higher k = more context but potentially less relevant
Higher threshold = more precise but may miss relevant content
Managing Your Knowledge Base
View Documents
- Go to your space
- Click Knowledge tab
- See all documents and their status
Search Documents
Use the search bar to find specific documents:
Search: "quarterly report"Update Documents
To update a document:
- Delete the old version
- Upload the new version
Or use versioning:
- Keep both versions
- Mark old version as archived
Delete Documents
- Click the document
- Click Delete
- Confirm deletion
Deletion removes:
- Original document
- All chunks
- All embeddings
Performance Optimization
Processing Speed
Worker configuration in docker-compose.yml:
worker:
deploy:
replicas: 2 # Run 2 worker instances
resources:
limits:
cpus: '2'
memory: 2GDatabase Performance
Create indexes for faster queries:
-- Already included in migrations, but if needed:
CREATE INDEX idx_document_chunks_embedding
ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);Optimize PostgreSQL in env_db.env:
# Increase shared buffers
POSTGRES_SHARED_BUFFERS=256MB
# Increase work memory
POSTGRES_WORK_MEM=16MBCaching
Enable Redis caching for embeddings:
# In env_backend.env
CACHE_EMBEDDINGS=true
CACHE_TTL_SECONDS=3600Troubleshooting
Files Not Processing
Check worker status:
docker compose ps worker
docker compose logs workerCommon issues:
- Worker not running
- Insufficient memory
- Unsupported file format
- Corrupted file
Solution:
docker compose restart workerWeb Crawling Fails
Check error messages:
docker compose logs worker | grep crawlCommon issues:
- Website blocks crawlers (check robots.txt)
- Authentication required
- Rate limiting
- Network issues
Solution:
- Use specific URLs instead of root domains
- Reduce max_pages and max_depth
- Add delays between requests
Poor Retrieval Quality
Symptoms:
- AI doesn’t use uploaded documents
- Irrelevant chunks retrieved
- Missing important information
Solutions:
-
Check chunk size:
- Too large = less precise
- Too small = loses context
-
Adjust retrieval parameters:
RETRIEVAL_K=10 # Retrieve more chunks RETRIEVAL_THRESHOLD=0.6 # Lower threshold -
Use better embeddings:
EMBEDDING_MODEL=text-embedding-3-large -
Improve document quality:
- Use text-based PDFs (not scanned images)
- Remove headers/footers
- Clean up formatting
Out of Memory Errors
Reduce worker memory usage:
# docker-compose.yml
worker:
deploy:
resources:
limits:
memory: 1G # Reduce from 2GProcess fewer documents simultaneously:
# env_backend.env
MAX_CONCURRENT_TASKS=2 # Reduce from defaultAdvanced Use Cases
Batch Upload via API
Upload multiple files programmatically:
import requests
files = [
('files', open('doc1.pdf', 'rb')),
('files', open('doc2.pdf', 'rb')),
('files', open('doc3.pdf', 'rb')),
]
response = requests.post(
'https://your-eneo-instance.com/api/spaces/123/knowledge/upload',
files=files,
headers={'Authorization': 'Bearer your-api-key'}
)Custom Chunking Strategies
For specialized content, implement custom chunking:
# In your fork of Eneo backend
from app.services.chunking import ChunkingStrategy
class CustomChunker(ChunkingStrategy):
def chunk(self, text: str) -> list[str]:
# Your custom logic
return chunksMulti-language Support
Process documents in multiple languages:
# env_backend.env
EMBEDDING_MODEL=multilingual-e5-large # Multi-language modelBest Practices
- Organize by topic: Create separate spaces for different topics
- Use descriptive names: Name documents clearly
- Keep documents updated: Regularly refresh content
- Monitor storage: Check disk usage periodically
- Test retrieval: Verify documents are being used in responses
- Start small: Test with a few documents before bulk upload
- Clean data: Remove unnecessary content before upload
Need Help?
- Processing issues: Check troubleshooting docs
- API reference: API documentation
- GitHub issues: Report bugs
- Email support: digitalisering@sundsvall.se (public sector organizations)