Architecture¶
This page explains the internal design of iris-haystack: how components are organized, why certain decisions were made, and how data flows through the system.
Component map¶
iris-haystack/
└── src/intersystems_iris_haystack/
├── document_stores/
│ ├── __init__.py ← public re-exports
│ └── document_store.py ← IRISDocumentStore + _BM25Index
└── components/
└── retrievers/
├── __init__.py ← public re-exports
├── embedding_retriever.py
└── bm25_retriever.py
The haystack_integrations namespace is shared across all Haystack integrations. Any integration installed in the same virtualenv contributes its modules under this namespace — it is a PEP 420 implicit namespace package.
Data flow¶
Indexing¶
sequenceDiagram
participant App
participant Embedder
participant DocumentWriter
participant IRISDocumentStore
participant IRIS
App->>Embedder: List[Document] (no embedding)
Embedder-->>DocumentWriter: List[Document] (with embedding: list[float])
DocumentWriter->>IRISDocumentStore: write_documents(docs, policy)
loop For each document
IRISDocumentStore->>IRISDocumentStore: json.dumps(meta, sort_keys=True)
IRISDocumentStore->>IRISDocumentStore: _embedding_to_str([v1,...,vN]) → "[v1,...,vN]"
IRISDocumentStore->>IRIS: INSERT ... TO_VECTOR(?, DOUBLE)
end
IRIS-->>IRISDocumentStore: rowcount
IRISDocumentStore-->>App: int (documents written) Semantic retrieval¶
sequenceDiagram
participant App
participant TextEmbedder
participant IRISEmbeddingRetriever
participant IRISDocumentStore
participant IRIS
App->>TextEmbedder: query string
TextEmbedder-->>IRISEmbeddingRetriever: query_embedding: list[float]
IRISEmbeddingRetriever->>IRISDocumentStore: _embedding_retrieval(query_embedding, top_k, filters)
IRISDocumentStore->>IRISDocumentStore: _embedding_to_str(query_embedding)
IRISDocumentStore->>IRIS: SELECT TOP ? ... VECTOR_COSINE(embedding, TO_VECTOR(?, DOUBLE)) AS similarity ... ORDER BY similarity DESC
IRIS-->>IRISDocumentStore: rows (id, content, meta, score, similarity)
loop For each row
IRISDocumentStore->>IRISDocumentStore: _row_to_document(row)
IRISDocumentStore->>IRISDocumentStore: dataclasses.replace(doc, score=similarity)
IRISDocumentStore->>IRISDocumentStore: document_matches_filter(filters, doc) [if filters]
end
IRISDocumentStore-->>IRISEmbeddingRetriever: List[Document]
IRISEmbeddingRetriever-->>App: {"documents": List[Document]} Key design decisions¶
1. document_matches_filter for all filtering¶
The filtering logic uses Haystack's official document_matches_filter utility instead of a custom implementation:
from haystack.utils.filters import document_matches_filter
return [d for d in docs if document_matches_filter(filters, d)]
Why: This is the same function used by InMemoryDocumentStore. Using it guarantees:
- Identical filter behaviour across all Haystack DocumentStores
- All
FilterDocumentsTestmix-in cases pass without custom logic - Future Haystack filter improvements are inherited for free
Trade-off: All documents are loaded from IRIS into Python before filtering. For write-heavy, read-light workloads on small collections this is fine. For large collections with simple filters, a SQL WHERE clause would be more efficient.
2. sort_keys=True in meta serialization¶
Why: The meta column stores JSON as a string. sort_keys=True ensures that {"b": 1, "a": 2} and {"a": 2, "b": 1} produce the same string, which:
- Makes document IDs deterministic (Haystack uses content hash for IDs)
- Enables reliable LIKE-pattern matching as a fast pre-filter
- Avoids surprises when the same dict is serialized in different insertion orders
3. dataclasses.replace() for score assignment¶
Why: Document is a dataclass in Haystack 2.x. In a pipeline, the same Document object may be referenced by multiple components. Mutating doc.score = value in place generates a Haystack warning and can cause subtle bugs where one component's change affects another's view of the document. dataclasses.replace creates a new instance, preserving the original.
4. Haystack Secret for credentials¶
def __init__(
self,
*,
connection_string: Secret = Secret.from_env_var("IRIS_CONNECTION_STRING"),
username: Secret = Secret.from_env_var("IRIS_USERNAME"),
password: Secret = Secret.from_env_var("IRIS_PASSWORD"),
...
Why:
Secretvalues are never included into_dict()output, preventing credential leakage in serialized YAML pipelinesSecret.from_env_vardefers resolution to runtime, not import time- Follows the same pattern as
MongoDBAtlasDocumentStore,PgvectorDocumentStore, and other official integrations
5. Automatic reconnection with _ensure_connection¶
def _ensure_connection(self) -> None:
try:
cur = self._conn.cursor()
cur.execute("SELECT 1")
cur.close()
except Exception:
logger.warning("IRIS connection lost — reconnecting...")
self._connect_with_retry()
Why: IRIS can close idle connections after a configurable timeout. Long-running applications (e.g., a web server with an infrequently used DocumentStore) would crash with a stale connection. The lightweight SELECT 1 ping adds negligible overhead while making the store resilient to network interruptions and IRIS restarts.
6. BM25 index rebuilt on every call¶
def _bm25_retrieval(self, query, *, filters=None, top_k=10):
candidates = self.filter_documents(filters=filters)
self._bm25.build([(d.id, d.content or "") for d in candidates])
...
Why: The BM25 index must always reflect the current state of the document store. Since there is no write hook to invalidate a cached index, rebuilding on demand is the simplest correct approach.
Trade-off: For large collections, this adds latency. Contributors can improve this with an event-based invalidation strategy (hooking into write_documents and delete_documents).
Table schema¶
CREATE TABLE IF NOT EXISTS SQLUser.HaystackDocuments (
id VARCHAR(128) NOT NULL PRIMARY KEY, -- Haystack-generated hash
content LONGVARCHAR, -- document text
meta LONGVARCHAR, -- json.dumps(sort_keys=True)
score DOUBLE, -- source score if available
embedding VECTOR(DOUBLE, 384) -- embedding_dim configurable
)
Why SQLUser schema?¶
IRIS requires a schema prefix. SQLUser is the default namespace schema for user-defined tables in the USER namespace and does not require additional permissions to create tables in it.
Why LONGVARCHAR for content and meta?¶
LONGVARCHAR is IRIS's variable-length text type without an upper size limit. Documents can be arbitrarily long and metadata can contain many keys — using a fixed-size VARCHAR(N) would silently truncate large documents.