RAG is often equated with vector databases, embeddings, and semantic search. But RAG really is just 'put relevant data in the prompt' — the retrieval mechanism is up to you. If your app already has any working search functionality, you can build a useful RAG pipeline with it. Semantic search can still be included later on, if really needed.
The pipeline below is sketched for a chat-style "answer my question" flow, but the same shape works for any other RAG systems - only the framing of input and output changes.
- User asks a natural-language question, e.g. "Which department handles parental leave requests?".
- A small LLM rewrites it into search keywords. For the example above:
parental leave department. - Your existing search runs against those keywords and returns the top N matches.
- You build a prompt that contains the top matches as context plus the original question and send it to a generative LLM.
- The generative LLM produces an answer grounded in the search results.
The implementation for each of those steps will vary greatly based on your existing app. Below are a few tips regarding steps 2 and 4.
Rewrite natural questions into search queries
Use a fast LLM to extract the search terms, as classic search engines are mostly unable to parse natural language queries. Example implementation:
Ollama::Client.new.post('/chat',
model: 'ministral-3:8b',
stream: false,
format: 'json',
options: { temperature: 0, num_ctx: 1_024, num_predict: 1_024 },
messages: [{ role: 'user', content: <<~PROMPT }],
Extract the key search terms from the user's question. Use only words from the input, in their base form. Drop filler words, articles, and pronouns. Keep the input language.
Respond with JSON: { "search_query": "term1 term2 term3" }
User input: #{query.inspect}
PROMPT
)
- A small/fast model handles this fine.
- Use
temperature: 0and request structured JSON. The transformation is deterministic; you don't want creativity. - The prompt's language should match that used by a majority of your users to reduce the chance that the LLM accidentally translates the query.
- It might be a good idea to cache the result (
Rails.cache) for some time so a page reload doesn't re-call the LLM.
Authorize the search call
The user query is injected into the LLM prompt and then into the search engine. Run the search with the same authorization as any user-facing API - the LLM rewrite step does not sanitize anything.
Build the LLM context from search results
Concatenating full documents into the prompt blows the context window. Instead:
- If your search engine returns highlighted snippets (OpenSearch / Elasticsearch
highlight), use them. They already point at the relevant text. - Widen the
snippet window
Show archive.org snapshot
for the RAG use case so the LLM has enough surrounding context (
fragment_size). - If your search engine only returns full documents, you'll have to build snippets yourself: find keyword positions in the result text and slice a window around each match.
- Cap the total context size. Sum the fragment lengths and stop adding more results once you hit your token budget.
- Structure each result with metadata (document title, URL, author) so the LLM can cite sources back to the user. XML works well here:
<documents>
<document>
<title>Parental Leave Policy</title>
<url>https://intranet.example.com/hr/parental-leave</url>
<match>
<page>2</page>
<content><![CDATA[Requests for parental leave are handled by the HR department...]]></content>
</match>
</document>
</documents>
- Decide whether the LLM should answer strictly from the provided context or may draw on its general knowledge. For document search or compliance use cases you usually want the former - instruct the LLM explicitly to "Answer only based on the documents below". For a general-purpose chatbot this restriction is unnecessary and may frustrate users.
Caveats
- Keyword search misses semantic matches. A query for "vacation policy" won't find "PTO guidelines" unless your synonyms / analyzers handle it.
- Rewriting natural language into keywords is lossy. Long questions with multiple intents may collapse into a few terms that lose nuance.
- You're adding LLM latency before the search even runs. Budget for it (~200-500ms with a small model on Ollama).
When to reach for vectors anyway
You need semantic vector embeddings if:
- Your data is unstructured prose with rich semantic relationships (research papers, support tickets) and synonyms matter heavily.
- Users phrase the same intent in many different ways, and your synonym dictionary can't keep up.
Otherwise, try this classic-search path first. It's cheaper to build, faster to query, and reuses everything you already have.
If recall on synonyms is the only weakness, a hybrid search approach where you combine results from a classical and a vector search engine gets you the best of both with the cost of twice the complexity. See An introduction to Hybrid search for the trade-offs, or How to add RAG to your Rails application to get started on the vector branch.