Evals First, Code Later: A Practical Guide to Evaluations, Rerankers & Caches
Many retrieval-augmented generation (RAG) and code-search pipelines rely on ad-hoc checks and break when deployed at scale.
This talk presents an evaluation-first development workflow applied to a production code-search engine built with Python, PostgreSQL (pgvector), and OpenAI rerankers. Introducing automated evaluation suites before optimisation cut average query latency from 20 min to 30 s, delivered a 40 × speed-up, and raised relevance by ≈ 30 %. We will cover:
- Constructing task-specific evaluation datasets and metrics
- Hybrid (lexical + ANN) retrieval
- Cross-encoder reranking for precision boosts
- Semantic caching strategies that keep indexes fresh and queries fast
The session includes benchmark results, a live demonstration, and an MIT-licensed reference implementation that attendees can clone and extend.