An introduction to Hybrid search

Hybrid search runs a vector query and a keyword query in parallel against the same documents and merges the two ranked search results. Reach for it once either approach alone reaches its limit.

Vector search

Vector search converts documents and queries into embeddings (numeric vectors that place semantically similar text close together) and retrieves the closest matches with a k-Nearest-Neighbors (kNN) lookup. It is strong on synonyms, paraphrasing, and multilingual matches, where the user's wording differs from the document's. It's weak on rare words and expressions, unique names, codes, model numbers, and exact phrases. Those tend to be averaged out during the embedding process.

+-- vector indexing -------------------------------------------------+
| document page -> embeddings -> vector index                        |
+--------------------------------------------------------------------+

+-- vector retrieval ------------------------------------------------+
| user query -> query embedding -> kNN search -> ranked pages        |
+--------------------------------------------------------------------+

See: How to add RAG to your Rails application

Full-text keyword search (BM25)

Full-text search splits each document into tokens (words or word stems) and builds an inverted index that maps every token back to the documents that contain it. BM25 is the default scoring algorithm on top of that index in OpenSearch and most other full-text engines. It is strong on exact terms, unique names, abbreviations, and queries where the user already knows the right vocabulary. It's weak on synonyms, paraphrasing, and natural-language questions, unless your tokenizer and analyzer chain compensates for it.

+-- full-text indexing ----------------------------------------------+
| document page -> tokenize -> inverted index                        |
+--------------------------------------------------------------------+

+-- full-text retrieval ---------------------------------------------+
| user query -> LLM keyword rewrite -> BM25 search -> ranked pages   |
+--------------------------------------------------------------------+

See: You don't need a vector database to build a RAG system

Hybrid search

The two branches have complementary failure modes. A document containing the exact policy ID the user typed will rank high in BM25 and nowhere in vector search. A document that paraphrases the user's question in different vocabulary will rank high in vector search and nowhere in BM25. Combining both approaches has the potential to improve recall (catch more of the relevant documents) without sacrificing precision (keep irrelevant ones out of the result list).

The basic idea for hybrid search is simple: Indexing runs both pipelines in parallel against the same source documents. Retrieval runs both queries against the user's input, produces two ranked lists, and merges them into one. And if can't have enough complexity in your codebase, you are free to add a third and forth pipeline as well.

Indexing flow

                              documents
                                  |
                        document pages (chunks)
                                  |
                  +---------------+---------------+
                  |                               |
       +----------------------+        +----------------------+
       |   vector indexing    |        |  full-text indexing  |
       +----------------------+        +----------------------+
                  |                               |
             vector index                    keyword index

Retrieval flow

                              user query
                                  |
                  +---------------+---------------+
                  |                               |
       +----------------------+       +-----------------------+
       |   vector retrieval   |       |  full-text retrieval  |
       +----------------------+       +-----------------------+
                  |                               |
                  +---------------+---------------+
                                  |
                            score fusion
                                  |
                             rank fusion
                                  |
                           ranked results

The interesting work happens at the fusion stage. Each branch hands you an independently ranked list of pages, and you somehow have to combine them into one final result.

Result ranking flow (fusion)

In our example we have some sort of hierarchy in our index: documents were first split into pages which are indexed individually. Both branches return ranked pages, but we want a ranked list of documents with their best pages attached. So we aggregate results by their corresponding document first, then rank pages within each document.

Score fusion decides the top-level document ranking. We normalize each branch's hits to [0, 1] independently using min-max scaling, aggregate per document by taking the highest normalized page score in each branch, then combine the two per-document scores with a weighted mean (e.g. 0.7 * BM25_score + 0.3 * vector_score). The combined score sets the document's position in the result list. Normalization matters here because BM25 produces unbounded positive scores (commonly between 5 and 25, depending on the corpus) while cosine similarity (the standard distance metric for vector search) sits in [0, 1] or [-1, 1] depending on whether negative values are clamped. Without normalizing first, BM25 would dominate the weighted mean purely by magnitude.

Rank fusion then orders the pages inside each document. Reciprocal Rank Fusion (RRF) merges the rank positions across branches...

score(page) = sum over branches of 1 / (60 + rank_in_branch)

...with 60 being the standard literature default Show archive.org snapshot . A page ranked 1 in one branch and 5 in the other beats a page ranked 3 in both. RRF is well suited here because the absolute scores already did their job at the document level. For the page ordering inside a document, only the relative rank in each branch matters.

In practice, the whole process could look something like this:

           vector branch                       full-text branch
   raw kNN distance -> normalized          raw BM25 -> normalized
   -----------------------------       -----------------------------
   1. doc-A page 3   0.92 -> 1.00      1. doc-B page 1   15.2 -> 1.00
   2. doc-B page 1   0.81 -> 0.78      2. doc-A page 7   12.1 -> 0.66
   3. doc-A page 7   0.65 -> 0.46      3. doc-A page 3   10.5 -> 0.48
   4. doc-C page 2   0.58 -> 0.32      4. doc-D page 4    8.3 -> 0.24
   5. doc-D page 4   0.42 -> 0.00      5. doc-C page 2    6.1 -> 0.00
                   |                                  |
                   +---------- score fusion ----------+
                       (weighted mean per document)
                           1. doc-B   0.93
                           2. doc-A   0.76
                           3. doc-D   0.17
                           4. doc-C   0.10
                                     |
                             rank fusion (RRF)
                                     v
                           1. doc-B   page 1
                           2. doc-A   page 3, page 7
                           3. doc-D   page 4
                           4. doc-C   page 2

More things to keep in mind

Fine-tune the score fusion weights

In the example above we chose to weight the results from the BM25 branch with 0.7 and the vector branch with 0.3. You should treat the weight as a knob you can use to tune the overall performance. It depends heavily on the embedding model, the indexed documents, and the type of queries used by a majority of your users.

Apply a minimum-similarity cutoff to the vector branch

A kNN search always returns the requested number of neighbors, even when nothing in the corpus is actually relevant to the query. If the closest match has a cosine similarity of say 0.3, it should not influence ranking. Drop vector hits below a threshold before feeding them into fusion. The right cutoff is corpus- and model-dependent (often somewhere between 0.5 and 0.7 for cosine), so your perfect score will be different. It's again something to tune based on your real indexed data.

Pagination makes everything harder

Min-max normalization depends on the candidate pool. If the pool changes between page 1 and page 2 (because you only fetched results for the current page), the same document gets different normalized scores on different pages. Fetch a fixed candidate pool large enough to cover the deepest page your UI allows (e.g. if pagination tops out at 10 pages of 20 results, fetch at least 200 candidates from each branch), normalize over that fixed pool, then slice out the current page.

If your UI allows unbounded pagination, there is no fixed pool to normalize over. In practice, cap the pool anyway. Even general-purpose search engines do this (Google's results cap out somewhere around the 1000th hit) and users almost never paginate that deep. If you genuinely cannot cap, drop min-max altogether and use a rank-based fusion like RRF at the document level too, since RRF does not depend on score scales or pool boundaries.

Different branches need different query shapes

Each branch wants a different shape of input. The vector branch performs better with the natural-language query ("Which department handles parental leave requests?"). BM25 performs better with extracted keywords ("parental leave department"), unless you have a very elaborate analyzer (stop-word filter, stemming, language-aware tokenization, etc.).

Document and Query Embeddings need different instructions

Some embedding models like Qwen3-Embedding Show archive.org snapshot yield better results if you use different instructions (specific prefixes) during indexing and retrieval. Always read the model card of every LLM you use, there are always small subtleties to be aware of.

Computing embeddings adds extra latency to the retrieval process

Computing the query embedding adds a network round-trip and a few hundred milliseconds to the search functionality. Pagination, page refreshes, and back-button navigation re-issue the same query - it's a good default to cache the query embeddings for an hour or so.

Keep your LLM connection in one place

OpenSearch comes with a whole AI ecosystem Show archive.org snapshot with custom adapters and ingest pipelines that can call your embedding model directly from inside the cluster. We tried that route and reverted it. If your application already has a direct connection to the LLM provider, computing embeddings application-side and shipping the vectors back to OpenSearch is just much more robust and simple. This way, you can also mock the LLM calls in local and CI tests.