Enterprise RAG Knowledge Assistant

Overview

Engineers were losing real hours every week digging through a deep technical document corpus to answer questions a prompt could resolve in seconds — if it had access to the right context. The trap is the obvious one: a model with no grounding will confidently make things up, and once an engineer is burned by one bad answer they stop trusting the tool entirely.

I built a RAG (Retrieval Augmented Generation) assistant that retrieves passages from the corpus first, then asks the model to answer only from what it retrieved, citing the source document on every response.

Approach

The architecture is three layers, each independently testable:

Ingestion. Parse the technical docs, split them into semantically meaningful chunks (not by character count — by section / paragraph), embed each chunk with a sentence-embedding model, and store the vectors alongside their source metadata.
Retrieval. At query time, embed the question, run vector search against the chunk index, return the top-k most relevant chunks with their source pointers.
Generation. Compose a prompt that gives the model the question and the retrieved chunks with instructions: answer from these chunks only, cite the source on each claim, say “I don’t know” when the chunks don’t cover the question.

What I built

Document ingestion pipeline in Python that handles the corpus formats, splits intelligently, embeds, and writes to the vector store
REST API in JavaScript exposing a /query endpoint that takes a natural-language question and returns the answer + citations
Prompt engineering focused on grounding — system prompts that explicitly forbid ungrounded answers, examples that show the model what citation format to use, fallback messaging when retrieval returns nothing
Quality measurement loop — track which queries return “I don’t know” so we can identify gaps in the corpus rather than accepting low confidence as an answer

Results

Manual document lookup time down 35% for the engineers using it
Citation requirement caught the model attempting to answer beyond its grounding multiple times during testing — those cases became corpus gaps to fill, not bugs to debug
The retrieval layer is the lever that matters most: prompt tuning on top of bad retrieval is wasted effort. We invested where it counted.

Lessons

RAG quality is dominated by retrieval quality. The prompt engineering matters but it’s a multiplier on what retrieval gives you, not a substitute. If the right chunks aren’t in the top-k, no prompt rescues that. Keep the chunking semantically aware, keep the embeddings up to date with the corpus, and the rest of the system gets a lot easier.