IRISBm25Retriever¶
IRISBm25Retriever performs keyword-based search using the Okapi BM25 probabilistic ranking algorithm, computed in-memory over the (optionally filtered) document set loaded from IRIS.
How it works¶
BM25 is the ranking function behind most traditional search engines. It scores documents by how often query terms appear in them, with two adjustments:
- Term frequency saturation — a term appearing 100 times is not 100× more relevant than one appearing once. The
k1parameter controls this saturation. - Length normalization — a 10-word document where your term appears twice is more relevant than a 1000-word document with the same term twice. The
bparameter controls this.
The BM25 index is built from scratch on every run() call to always reflect the current state of the document store.
Algorithm (Okapi BM25)¶
For each document \(d\) and query \(q\) with terms \(q_1 \ldots q_n\):
Where:
- \(f(q_i, d)\) = frequency of term \(q_i\) in document \(d\)
- \(|d|\) = document length in tokens
- \(\text{avgdl}\) = average document length across the corpus
- \(k_1\) = term frequency saturation parameter (default: 1.5)
- \(b\) = length normalization parameter (default: 0.75)
Basic usage¶
Standalone¶
from intersystems_iris_haystack.document_stores import IRISDocumentStore
from intersystems_iris_haystack.components.retrievers import IRISBm25Retriever
store = IRISDocumentStore(embedding_dim=384)
retriever = IRISBm25Retriever(document_store=store, top_k=5)
result = retriever.run(query="multimodel database SQL JSON")
for doc in result["documents"]:
print(f"[{doc.score:.4f}] {doc.content}")
In a pipeline¶
from haystack import Pipeline
from intersystems_iris_haystack.components.retrievers import IRISBm25Retriever
pipeline = Pipeline()
pipeline.add_component(
"retriever",
IRISBm25Retriever(document_store=store, top_k=5),
)
result = pipeline.run({"retriever": {"query": "vector search embeddings"}})
for doc in result["retriever"]["documents"]:
print(f"[{doc.score:.4f}] {doc.content[:60]}...")
Parameters¶
__init__ parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
document_store | IRISDocumentStore | required | The store instance to query |
top_k | int | 10 | Maximum number of documents to return |
filters | dict \| None | None | Pre-filters applied before BM25 ranking. Reduces the candidate set. |
filter_policy | FilterPolicy \| str | FilterPolicy.REPLACE | How init-time and runtime filters interact |
run() parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | required | Natural language search text |
filters | dict \| None | None | Runtime filter |
top_k | int \| None | None | Override the top_k set at initialization |
Output¶
Each returned Document has its score field set to the BM25 score. Documents with a score of 0 (no query term matched) are excluded.
BM25 parameters¶
The k1 and b parameters are set on the DocumentStore, not the retriever:
store = IRISDocumentStore(
embedding_dim=384,
bm25_k1=1.2, # lower = faster saturation (good for short docs)
bm25_b=0.75, # 0.0 = no length normalization, 1.0 = full
)
Tuning guidelines¶
| Scenario | Recommended settings |
|---|---|
| Short documents (tweets, titles) | k1=1.2, b=0.3 — less length normalization |
| Long documents (articles, books) | k1=1.5, b=0.75 — standard |
| Very long documents with many repetitions | k1=2.0, b=0.75 — higher saturation |
| Uniform document lengths | k1=1.5, b=0.0 — disable length normalization |
FilterPolicy¶
Same as IRISEmbeddingRetriever — see the FilterPolicy section.
When filters are provided, the retriever first calls filter_documents(filters) to narrow the candidate set, then builds the BM25 index over those candidates only:
# Only index and rank documents matching the filter
retriever = IRISBm25Retriever(
document_store=store,
top_k=5,
filters={"language": "en"},
filter_policy=FilterPolicy.REPLACE,
)
result = retriever.run(query="machine learning", filters={"category": "ai"})
# Effective filter: {"category": "ai"} (REPLACE)
# BM25 built over those documents only
Tokenization¶
The BM25 tokenizer uses a simple regex that extracts alphanumeric tokens and normalizes to lowercase. It supports accented characters (UTF-8):
# Input: "InterSystems IRIS is a high-performance, multimodel database."
# Tokens: ["intersystems", "iris", "is", "a", "high", "performance", "multimodel", "database"]
This tokenizer does not perform stemming or stop-word removal. The query "database" will match documents containing "database" but not "databases".
For multilingual content
The tokenizer works for any UTF-8 text including Portuguese, Spanish, French, German, and other Latin-script languages. Non-Latin scripts (Chinese, Japanese, Arabic) are not tokenized correctly — consider using the embedding retriever for multilingual content.
Comparing BM25 and embedding retrieval¶
| Aspect | BM25 (IRISBm25Retriever) | Embedding (IRISEmbeddingRetriever) |
|---|---|---|
| Match type | Exact keyword overlap | Semantic meaning |
| Model required | No | Yes (embedding model) |
| Cost | Free — computed in Python | Embedding inference cost |
| Strength | Exact product names, codes, IDs | Synonyms, paraphrases, cross-lingual |
| Weakness | Misses synonyms | Misses exact rare terms |
| Best for | Technical docs, code, logs | General text, Q&A, multilingual |
Hybrid retrieval¶
For the best of both worlds, run both retrievers and merge their results:
from haystack.components.rankers import TransformersSimilarityRanker
pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model="..."))
pipeline.add_component("embedding_retriever", IRISEmbeddingRetriever(store, top_k=20))
pipeline.add_component("bm25_retriever", IRISBm25Retriever(store, top_k=20))
pipeline.add_component("ranker", TransformersSimilarityRanker(top_k=5))
pipeline.connect("text_embedder.embedding", "embedding_retriever.query_embedding")
pipeline.connect("embedding_retriever.documents", "ranker.documents")
pipeline.connect("bm25_retriever.documents", "ranker.documents")
result = pipeline.run({
"text_embedder": {"text": "my query"},
"bm25_retriever": {"query": "my query"},
"ranker": {"query": "my query"},
})
Performance notes¶
The BM25 index is rebuilt from scratch on every run() call. For a store with:
- < 10k documents — rebuilding takes < 50 ms, imperceptible
- 10k–100k documents — rebuilding takes 0.5–5 s, noticeable
- > 100k documents — rebuilding takes > 5 s, consider caching
To reduce rebuild time for large collections, always use filters to narrow the candidate set before ranking: