What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

Weaviate in Production — Vector Search, GraphQL API, and Hybrid Retrieval

Why Weaviate Over a Standalone ANN Index

Approximate nearest-neighbor libraries like FAISS and Annoy solve the search problem but leave everything else — persistence, replication, multi-tenancy, filtering, and schema evolution — to you. Weaviate is a purpose-built vector database that wraps HNSW-based ANN search with a full data management layer: objects are stored with properties, typed schemas enforce structure, and filters apply at query time without post-processing. The result is a system you can deploy once and scale incrementally rather than stitching together a vector index with a separate metadata store.

Weaviate's module system is a key differentiator. Rather than requiring you to generate embeddings before import, you configure a vectorizer module per collection — text2vec-openai, text2vec-cohere, text2vec-transformers for a locally-hosted model, or none to supply your own vectors — and Weaviate calls the embedding API automatically on insert and on nearText queries. This removes the embedding orchestration layer from your application code. The RAG Done Right guide covers the retrieval architecture in depth, including chunking strategies, re-ranking, and the evaluation loop — Weaviate slots in as the retriever that powers those pipelines.

Hybrid Search

BM25 keyword search and vector similarity combined via reciprocal rank fusion. The alpha parameter tunes the blend: 0.0 is pure BM25, 1.0 is pure vector, and 0.75 is a typical production starting point for document retrieval.

Multi-Tenancy

Each tenant gets an isolated HNSW index and property store. Tenants can be activated and deactivated to manage memory — cold tenants are offloaded to disk and reloaded on first query, making per-customer isolation economically viable at thousands of tenants.

Named Vectors

An object can carry multiple independent vector representations — a title embedding from one model, a body embedding from another, an image embedding alongside text. Queries target a specific named vector without duplicating the object.

Installation — Docker Compose and Kubernetes Helm

The fastest path for local development is the official Docker Compose configuration. Enable the modules you plan to use at startup — modules are not hot-swappable on a running cluster.

# docker-compose.yml — Weaviate with OpenAI and BM25 hybrid search
version: "3.8"
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.25.4
    restart: unless-stopped
    ports:
      - "8080:8080"    # REST + GraphQL
      - "50051:50051"  # gRPC (used by Python client v4 for batch import)
    environment:
      QUERY_DEFAULTS_LIMIT: "25"
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "false"
      AUTHENTICATION_APIKEY_ENABLED: "true"
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: "${WEAVIATE_API_KEY}"
      AUTHENTICATION_APIKEY_USERS: "admin"
      AUTHORIZATION_ADMINLIST_ENABLED: "true"
      AUTHORIZATION_ADMINLIST_USERS: "admin"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      DEFAULT_VECTORIZER_MODULE: "text2vec-openai"
      ENABLE_MODULES: "text2vec-openai,text2vec-cohere,generative-openai,reranker-cohere"
      CLUSTER_HOSTNAME: "node1"
      OPENAI_APIKEY: "${OPENAI_API_KEY}"
    volumes:
      - weaviate_data:/var/lib/weaviate

volumes:
  weaviate_data:

For Kubernetes production deployments, the official Helm chart handles replication, persistent volumes, and resource limits. Install with Helm 3:

# Add the Weaviate Helm repository
helm repo add weaviate https://weaviate.github.io/weaviate-helm
helm repo update

# values.yaml — production configuration
cat > weaviate-values.yaml << 'EOF'
replicas: 3

resources:
  requests:
    cpu: "2"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "16Gi"

storage:
  size: 100Gi
  storageClassName: "gp3"

env:
  PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
  DEFAULT_VECTORIZER_MODULE: "text2vec-openai"
  ENABLE_MODULES: "text2vec-openai,generative-openai,reranker-cohere"
  CLUSTER_GOSSIP_BIND_PORT: "7946"
  CLUSTER_DATA_BIND_PORT: "7777"
  REPLICATION_MINIMUM_FACTOR: "2"

authentication:
  anonymous_access:
    enabled: false
  apikey:
    enabled: true
    allowed_keys:
      - "${WEAVIATE_API_KEY}"
    users:
      - "admin"

authorization:
  admin_list:
    enabled: true
    users:
      - "admin"

metrics:
  enabled: true
  serviceMonitor:
    enabled: true   # requires kube-prometheus-stack CRD
EOF

helm install weaviate weaviate/weaviate   --namespace weaviate   --create-namespace   --values weaviate-values.yaml   --set env.OPENAI_APIKEY="${OPENAI_API_KEY}"

Note

Weaviate uses a gossip protocol for cluster membership — all nodes must be able to reach each other on port 7946. In Kubernetes, the Helm chart creates a headless service that enables pod-to-pod DNS resolution across the StatefulSet. Set REPLICATION_MINIMUM_FACTOR to a value equal to the quorum of your replica count to prevent writes during partial cluster failure.

Schema Design — Collections, Properties, and Vectorizer Configuration

A Weaviate collection (called a class in the v3 API) defines the property schema, the vectorizer module, and the HNSW index configuration. Schema design decisions are largely permanent — changing the vectorizer requires a full re-import because the stored vectors are incompatible across embedding models.

# Python client v4 — install
pip install weaviate-client>=4.0.0

import weaviate
import weaviate.classes as wvc
from weaviate.classes.config import Configure, Property, DataType, Tokenization

client = weaviate.connect_to_local(
    host="localhost",
    port=8080,
    grpc_port=50051,
    auth_credentials=weaviate.auth.AuthApiKey(api_key="your-api-key"),
    headers={"X-OpenAI-Api-Key": "sk-..."},
)

# --- Define a Document collection with text2vec-openai vectorization ---
client.collections.create(
    name="Document",
    description="Enterprise knowledge base documents",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",   # 1536-dim, 62,500 tokens/dollar
        dimensions=512,                    # Matryoshka reduction — lower cost, slight quality trade-off
        vectorize_collection_name=False,   # don't include class name in embedding
    ),
    generative_config=Configure.Generative.openai(
        model="gpt-4o-mini",              # for RAG generation after retrieval
    ),
    reranker_config=Configure.Reranker.cohere(
        model="rerank-english-v3.0",
    ),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            tokenization=Tokenization.LOWERCASE,   # BM25 tokenizer for hybrid search
            vectorize_property_name=False,
            skip_vectorization=False,              # include title in vector input
        ),
        Property(
            name="body",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD,
            skip_vectorization=False,
        ),
        Property(
            name="source",
            data_type=DataType.TEXT,
            tokenization=Tokenization.FIELD,       # exact match only — URLs, IDs
            skip_vectorization=True,               # don't include source URL in vector
        ),
        Property(
            name="department",
            data_type=DataType.TEXT,
            tokenization=Tokenization.FIELD,
            skip_vectorization=True,
        ),
        Property(
            name="published_at",
            data_type=DataType.DATE,
            skip_vectorization=True,
        ),
        Property(
            name="chunk_index",
            data_type=DataType.INT,
            skip_vectorization=True,
        ),
    ],
    vector_index_config=Configure.VectorIndex.hnsw(
        ef_construction=128,   # build quality: higher = slower build, better recall
        max_connections=64,    # HNSW M parameter: neighbors per node
        ef=64,                 # query-time beam width; overridable per-query
        distance_metric=wvc.config.VectorDistances.COSINE,
    ),
    inverted_index_config=Configure.inverted_index(
        bm25_b=0.75,           # document length normalization (0=no norm, 1=full norm)
        bm25_k1=1.2,           # term frequency saturation
        index_timestamps=True, # enables filtering by creation/update time
        index_null_state=True, # enables filtering for null property values
    ),
)
print("Collection 'Document' created.")

Batch Import — Fixed-Size and Dynamic Strategies

Weaviate's Python client v4 ships a batching context manager that handles retries, error collection, and throughput tuning. For most production imports, the dynamic strategy adjusts batch size automatically based on server response latency, which prevents overloading the vectorizer API during concurrent requests. The vector databases comparison article benchmarks Weaviate import throughput against Qdrant and pgvector for datasets in the 10M–100M range — the results show dynamic batching outperforms fixed sizes for heterogeneous document lengths.

import weaviate
from weaviate.classes.config import Configure
from datetime import datetime, timezone
import json

client = weaviate.connect_to_local(
    host="localhost",
    port=8080,
    grpc_port=50051,
    auth_credentials=weaviate.auth.AuthApiKey("your-api-key"),
    headers={"X-OpenAI-Api-Key": "sk-..."},
)

documents_collection = client.collections.get("Document")

# --- Dynamic batch: Weaviate adjusts batch size to maintain 100-400ms server latency ---
with documents_collection.batch.dynamic() as batch:
    for doc in load_documents():   # your document iterator
        batch.add_object(
            properties={
                "title":       doc["title"],
                "body":        doc["body"],
                "source":      doc["source"],
                "department":  doc["department"],
                "published_at": datetime.fromisoformat(doc["date"]).replace(
                    tzinfo=timezone.utc
                ).isoformat(),
                "chunk_index": doc["chunk_index"],
            },
            # Omit 'vector' to let Weaviate call text2vec-openai automatically.
            # Supply 'vector' to bypass vectorization (e.g. pre-computed embeddings):
            # vector=doc["embedding"],
        )
        if batch.number_errors > 10:
            print(f"Too many errors: {batch.errors}")
            break

# Check failed objects after the context manager exits
failed = documents_collection.batch.failed_objects
if failed:
    print(f"{len(failed)} objects failed:")
    for err in failed[:5]:
        print(f"  {err.message}")

# --- Fixed-size batch: explicit control for rate-limited embedding APIs ---
with documents_collection.batch.fixed_size(batch_size=100, concurrent_requests=2) as batch:
    for doc in load_documents():
        batch.add_object(properties={...})

Note

The gRPC port (50051) is mandatory for Python client v4 batch operations. The client uses gRPC for the high-throughput import path and falls back to REST for metadata operations. If your firewall only exposes port 8080, batch imports will silently use REST and perform significantly worse — verify gRPC connectivity with grpcurl -plaintext localhost:50051 list before production import.

Hybrid Search — Blending BM25 and Vector Similarity

Hybrid search is Weaviate's most impactful feature for production RAG systems. Pure vector search struggles with proper nouns, version numbers, and exact identifiers — a query for "version 3.12.1 changelog" may retrieve semantically similar documents that don't contain the exact version string. Pure BM25 misses paraphrases and synonyms. The alpha parameter controls the blend via reciprocal rank fusion: results from both search methods are ranked independently, then their ranks are combined using the formula score = alpha * vector_score + (1 - alpha) * bm25_score applied to normalized rank positions.

from weaviate.classes.query import MetadataQuery, HybridFusion, Filter, Sort
import weaviate.classes as wvc

documents = client.collections.get("Document")

# --- Hybrid search with alpha=0.75 (75% vector, 25% BM25) ---
results = documents.query.hybrid(
    query="kubernetes cost optimization spot instances",
    alpha=0.75,
    fusion_type=HybridFusion.RELATIVE_SCORE,  # normalize scores before fusion (recommended)
    limit=10,
    return_metadata=MetadataQuery(score=True, explain_score=True),
    return_properties=["title", "body", "source", "department"],
)

for obj in results.objects:
    print(f"Score: {obj.metadata.score:.4f}")
    print(f"  Explain: {obj.metadata.explain_score}")
    print(f"  Title:   {obj.properties['title']}")
    print()

# --- Hybrid with property weighting: boost title matches over body ---
results_weighted = documents.query.hybrid(
    query="data pipeline testing",
    alpha=0.7,
    query_properties=["title^3", "body"],   # BM25 weights: title 3x body
    limit=5,
)

# --- nearText: pure vector search with certainty threshold ---
results_vec = documents.query.near_text(
    query="how to reduce Elasticsearch memory usage",
    certainty=0.75,        # minimum cosine similarity (0.0–1.0)
    limit=10,
    filters=Filter.by_property("department").equal("engineering"),
    return_metadata=MetadataQuery(certainty=True, distance=True),
)

# --- nearVector: search with a pre-computed embedding ---
my_embedding = get_embedding("custom query text")   # your embedding function
results_near_vec = documents.query.near_vector(
    near_vector=my_embedding,
    certainty=0.80,
    limit=5,
)

# --- BM25-only keyword search (alpha=0) ---
results_bm25 = documents.query.bm25(
    query="terraform state locking",
    limit=10,
    return_metadata=MetadataQuery(score=True),
)

Filtering, Sorting, and Aggregation

Weaviate applies filters before vector search, using an ACORN (Adaptive Composite Retrieval over HNSW Nodes) index for filtered HNSW traversal that avoids the full post-filter accuracy collapse seen in plain FAISS. Compound filters combine with Filter.all_of() (AND) and Filter.any_of() (OR). The Aggregate API counts objects and computes statistics without returning individual objects, which is useful for faceted search UI panels.

from weaviate.classes.query import Filter, Sort
from weaviate.classes.aggregate import GroupByAggregate
from datetime import datetime, timezone

documents = client.collections.get("Document")

# --- Compound filter: engineering docs published after 2025-01-01 ---
cutoff = datetime(2025, 1, 1, tzinfo=timezone.utc).isoformat()
results = documents.query.hybrid(
    query="microservices deployment patterns",
    alpha=0.7,
    limit=10,
    filters=(
        Filter.by_property("department").equal("engineering") &
        Filter.by_property("published_at").greater_than(cutoff)
    ),
)

# --- Sorting: vector search with date descending (latest first) ---
results_sorted = documents.query.near_text(
    query="kafka consumer lag monitoring",
    limit=20,
    sort=Sort.by_property("published_at", ascending=False),
)

# --- Aggregate: count documents per department ---
agg = documents.aggregate.over_all(
    group_by=GroupByAggregate(prop="department"),
    return_metrics=wvc.aggregate.AggregateText(
        property="department",
        count=True,
    ),
)
for group in agg.groups:
    print(f"  {group.grouped_by.value}: {group.total_count} documents")

# --- Aggregate with filter: documents in the last 30 days ---
from datetime import timedelta
recent_cutoff = (datetime.now(timezone.utc) - timedelta(days=30)).isoformat()
recent_count = documents.aggregate.over_all(
    filters=Filter.by_property("published_at").greater_than(recent_cutoff),
    total_count=True,
)
print(f"Documents added in last 30 days: {recent_count.total_count}")

Named Vectors — Multiple Representations per Object

Named vectors allow a single object to carry independent vector representations without duplication. A product catalog object might have a title embedding optimized for short-query matching (from a small model) and a description embedding from a larger model for semantic richness. Queries target a specific named vector by name, and hybrid search applies per-vector. If you're evaluating whether to use Weaviate or stay with pgvector for semantic search, the pgvector guide covers the trade-offs for datasets under 10M rows where PostgreSQL's ACID guarantees matter.

# Collection with named vectors (multi-representation)
client.collections.create(
    name="Product",
    description="E-commerce product catalog with title and description vectors",
    properties=[
        Property(name="title",       data_type=DataType.TEXT),
        Property(name="description", data_type=DataType.TEXT),
        Property(name="sku",         data_type=DataType.TEXT,  skip_vectorization=True),
        Property(name="category",    data_type=DataType.TEXT,  skip_vectorization=True),
        Property(name="price",       data_type=DataType.NUMBER, skip_vectorization=True),
    ],
    # Named vectors replace a single vectorizer_config
    vectorizer_config=[
        Configure.NamedVectors.text2vec_openai(
            name="title_vector",
            source_properties=["title"],        # only vectorize the title
            model="text-embedding-3-small",
        ),
        Configure.NamedVectors.text2vec_openai(
            name="description_vector",
            source_properties=["description"],  # only vectorize the description
            model="text-embedding-3-large",     # larger model for richer body embedding
        ),
    ],
    vector_index_config={
        "title_vector":       Configure.VectorIndex.hnsw(ef_construction=64, max_connections=32),
        "description_vector": Configure.VectorIndex.hnsw(ef_construction=128, max_connections=64),
    },
)

products = client.collections.get("Product")

# Search using the title vector (short queries, product name matching)
results = products.query.near_text(
    query="wireless noise-cancelling headphones",
    target_vector="title_vector",
    limit=10,
    return_metadata=MetadataQuery(certainty=True),
    filters=Filter.by_property("category").equal("Electronics"),
)

# Hybrid search using the description vector (long-form semantic queries)
results_hybrid = products.query.hybrid(
    query="comfortable headphones for long work-from-home sessions with good microphone",
    target_vector="description_vector",
    alpha=0.6,
    limit=10,
)

Multi-Tenancy — Isolated HNSW Indexes per Tenant

Weaviate's multi-tenancy implementation creates an independent HNSW index and property store per tenant within a single collection definition. Cross-tenant data leakage is impossible at the storage layer because each tenant's data is stored in separate segment files. Tenants can be activated (loaded into memory), deactivated (offloaded to disk), or deleted independently, making per-customer isolation viable for SaaS applications with thousands of tenants.

# Enable multi-tenancy on a collection at creation time
client.collections.create(
    name="CustomerDocument",
    multi_tenancy_config=Configure.multi_tenancy(
        enabled=True,
        auto_tenant_creation=False,   # require explicit tenant registration
        auto_tenant_activation=True,  # auto-activate cold tenants on first query
    ),
    properties=[
        Property(name="title",   data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="doc_id",  data_type=DataType.TEXT, skip_vectorization=True),
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small",
    ),
)

coll = client.collections.get("CustomerDocument")

# --- Tenant lifecycle management ---
from weaviate.classes.tenants import Tenant, TenantActivityStatus

# Create tenants (done during customer onboarding)
coll.tenants.create([
    Tenant(name="acme-corp",    activity_status=TenantActivityStatus.ACTIVE),
    Tenant(name="globex-inc",   activity_status=TenantActivityStatus.ACTIVE),
    Tenant(name="initech-llc",  activity_status=TenantActivityStatus.COLD),  # starts deactivated
])

# Insert into a specific tenant
tenant_coll = coll.with_tenant("acme-corp")
with tenant_coll.batch.dynamic() as batch:
    for doc in acme_documents:
        batch.add_object(properties={
            "title":   doc["title"],
            "content": doc["content"],
            "doc_id":  doc["id"],
        })

# Query is scoped to a single tenant — no cross-tenant results possible
results = tenant_coll.query.hybrid(
    query="quarterly revenue report",
    alpha=0.75,
    limit=5,
)

# Deactivate an inactive customer tenant to free memory
coll.tenants.update([
    Tenant(name="initech-llc", activity_status=TenantActivityStatus.COLD),
])

# Delete a churned customer's tenant and all its data
coll.tenants.remove(["churned-customer-id"])

# Get tenant status overview
all_tenants = coll.tenants.get()
active   = [t for t in all_tenants.values() if t.activity_status == TenantActivityStatus.ACTIVE]
inactive = [t for t in all_tenants.values() if t.activity_status == TenantActivityStatus.COLD]
print(f"Active: {len(active)}, Cold: {len(inactive)}")

Note

With auto_tenant_activation=True, a query against a cold tenant triggers an automatic load from disk before the query executes. This adds 100ms–2s of latency depending on tenant size and disk I/O. For SLA-sensitive endpoints, either keep active customers in ACTIVE state or implement a warming strategy that pre-activates tenants during off-peak hours using a scheduled job.

Production Configuration — Replication, Auth, and Backups

Weaviate uses a leaderless replication model similar to Cassandra: writes go to all replica nodes in the consistency quorum, and reads can be served by any replica. The replication_config is set per collection and applies to all write and read operations against that collection.

from weaviate.classes.config import Configure, Reconfigure

# --- Set replication factor at collection creation ---
client.collections.create(
    name="ReplicatedDocument",
    replication_config=Configure.replication(factor=3),
    # ... other config
)

# --- Read with consistency level ---
from weaviate.classes.query import ConsistencyLevel

docs = client.collections.get("ReplicatedDocument")

# ONE: fastest — serve from any single replica
results_fast = docs.query.near_text(
    query="...",
    limit=5,
    consistency_level=ConsistencyLevel.ONE,
)

# QUORUM: majority of replicas must agree — default for balanced workloads
results_balanced = docs.query.near_text(
    query="...",
    limit=5,
    consistency_level=ConsistencyLevel.QUORUM,
)

# ALL: all replicas must respond — use for read-your-writes after imports
results_consistent = docs.query.near_text(
    query="...",
    limit=5,
    consistency_level=ConsistencyLevel.ALL,
)

# --- Backup and restore to S3-compatible storage ---
# Requires BACKUP_S3_BUCKET env var set on the Weaviate pod/container
client.backup.create(
    backup_id="backup-2026-07-01",
    backend="s3",
    include_collections=["Document", "Product"],
    wait_for_completion=True,
)

backup_status = client.backup.get_create_status(
    backup_id="backup-2026-07-01",
    backend="s3",
)
print(f"Backup status: {backup_status.status}")

# Restore (on a fresh cluster or after data loss)
client.backup.restore(
    backup_id="backup-2026-07-01",
    backend="s3",
    wait_for_completion=True,
)

# RBAC with API keys — weaviate-values.yaml excerpt
# Weaviate 1.25+ supports role-based access control via the Authorization module

env:
  AUTHENTICATION_APIKEY_ENABLED: "true"
  AUTHENTICATION_APIKEY_ALLOWED_KEYS: "admin-key,readonly-key,writer-key"
  AUTHENTICATION_APIKEY_USERS: "admin,readonly-user,writer-user"
  AUTHORIZATION_ADMINLIST_ENABLED: "true"
  AUTHORIZATION_ADMINLIST_USERS: "admin"
  AUTHORIZATION_ADMINLIST_READONLY_USERS: "readonly-user"

# OIDC authentication (for enterprise SSO)
  AUTHENTICATION_OIDC_ENABLED: "true"
  AUTHENTICATION_OIDC_ISSUER: "https://auth.your-company.com/realms/prod"
  AUTHENTICATION_OIDC_CLIENT_ID: "weaviate"
  AUTHENTICATION_OIDC_USERNAME_CLAIM: "email"
  AUTHENTICATION_OIDC_GROUPS_CLAIM: "groups"
  AUTHORIZATION_ADMINLIST_GROUPS: "platform-engineers"
  AUTHORIZATION_ADMINLIST_READONLY_GROUPS: "data-analysts"

Observability — Prometheus Metrics and Query Latency Monitoring

Weaviate exposes Prometheus metrics on port 2112 (/metrics). The most operationally important metrics are query latency percentiles per collection, batch import throughput, and vectorizer call latency. A ServiceMonitor custom resource (requires kube-prometheus-stack) scrapes these automatically in Kubernetes.

# Key Weaviate Prometheus metrics for production dashboards

# --- Query latency (99th percentile alert: > 500ms) ---
# weaviate_queries_durations_ms_bucket (label: query_type, class_name)
histogram_quantile(0.99,
  rate(weaviate_queries_durations_ms_bucket[5m])
)

# --- Vectorizer call latency (OpenAI API latency) ---
# weaviate_module_time_us_bucket (label: module_name, operation)
histogram_quantile(0.95,
  rate(weaviate_module_time_us_bucket{module_name="text2vec-openai"}[5m])
) / 1000  # convert microseconds to milliseconds

# --- Batch import throughput (objects per second) ---
# weaviate_objects_durations_ms_bucket (label: step=batch_objects)
rate(weaviate_batch_durations_ms_count[1m]) * 60

# --- HNSW build latency (indicates index growth pressure) ---
# weaviate_lsm_bloom_filters_size_bytes (LSM segment growth indicator)
# weaviate_async_operations_running (label: operation=hnsw_vector_cache_prefill)

# --- Memory usage per collection ---
# weaviate_vector_index_size (label: class_name, shard_name)
sum by (class_name) (weaviate_vector_index_size)

# --- Error rate alert rule ---
# Alert when more than 1% of queries error in a 5-minute window
sum(rate(weaviate_queries_durations_ms_count{status="failed"}[5m]))
  /
sum(rate(weaviate_queries_durations_ms_count[5m]))
> 0.01

Production Checklist

Choose the vectorizer model before schema creation and do not change it after import. Weaviate stores the vectors produced by the configured model — if you swap from text-embedding-3-small to text-embedding-3-large, the existing vectors are incompatible and searches will return incoherent results. Plan for a full re-import if you need to upgrade embedding models. Use a migration collection (e.g. 'DocumentV2') to import with the new model while keeping the old collection live, then cut over traffic and drop the old collection.

Set 'skip_vectorization=True' on properties that should not influence the embedding. Including source URLs, UUIDs, and internal IDs in the text sent to the embedding API degrades embedding quality — the model allocates token budget to meaningless tokens. Only vectorize properties whose semantic content should influence similarity search. Use the 'vectorize_property_name=False' setting to exclude the property name from the input text, which prevents the model from over-weighting property names relative to values.

Enable gRPC on port 50051 for all production Weaviate deployments using the Python client v4. The v4 client uses gRPC for batch operations and falls back to REST for everything else — without gRPC, batch import throughput drops by 3–5x. Verify gRPC connectivity before import: 'grpcurl -plaintext weaviate:50051 weaviate.v1.Weaviate/BatchObjects'. In Docker, expose port 50051 explicitly; in Kubernetes, ensure the Service exposes both 8080 and 50051.

Use dynamic batching for mixed-length documents and fixed-size batching only when you control the rate precisely. Dynamic batching monitors server response time and adjusts batch size to maintain 100–400ms latency — this self-tunes for embedding API rate limits and Weaviate memory pressure. Fixed-size batching at a constant rate may overload the vectorizer API during spikes or under-utilize it during slow document processing. Set 'concurrent_requests=2' for dynamic batching to limit parallel in-flight batches.

Configure replication_factor=3 and REPLICATION_MINIMUM_FACTOR=2 for production clusters. A replication factor of 3 with a minimum factor of 2 means the cluster accepts writes and serves reads during single-node failure. Without replication, a single node failure makes data unavailable. Set per-collection consistency_level to QUORUM for write-after-read accuracy and ONE for read-heavy search endpoints where eventual consistency is acceptable.

Use multi-tenancy from the start for any SaaS workload rather than separate collections per customer. Adding multi-tenancy to an existing non-multi-tenant collection requires a full re-creation and re-import. With multi-tenancy, cold tenant offloading frees memory proportional to the number of inactive tenants — for a platform with 1000 customers where 900 are inactive, memory usage can be reduced by 90% compared to keeping all tenants active.

Tune the HNSW ef parameter per-query for latency vs recall trade-offs. The ef parameter controls beam width during HNSW graph traversal: lower ef (e.g. 32) gives faster queries at the cost of recall; higher ef (e.g. 256) improves recall at the cost of latency. Set ef in the collection's vector_index_config as the default, but override it per-query via additional_properties for endpoints with different SLAs. Monitor recall quality with a golden test set — if precision@10 degrades below your threshold, increase ef.

Schedule backup jobs to S3-compatible storage and test restore procedures quarterly. Configure BACKUP_S3_BUCKET, BACKUP_S3_REGION, and BACKUP_S3_ENDPOINT env vars. Run 'client.backup.create(wait_for_completion=True)' in a cron job and alert on failure. Test restore by creating a staging cluster, restoring the latest backup, and running a sample query against each critical collection. Backup time is proportional to data size — for large clusters, use include_collections to back up critical collections more frequently.

Monitor vectorizer module latency as a separate SLI from query latency. On nearText queries, Weaviate calls the embedding API (OpenAI, Cohere, etc.) to embed the query string before executing the HNSW search. If the embedding API is slow or rate-limited, query latency degrades independently of HNSW performance. Alert on weaviate_module_time_us_bucket p95 > 1000ms (1 second). Implement exponential backoff in the Weaviate configuration via the module's retries settings and use a dedicated API key with a higher rate limit tier for search queries versus import.

Validate import completeness after batch imports by comparing object counts. Use 'collection.aggregate.over_all(total_count=True)' to count objects in Weaviate and compare against your source system count. Batch imports with vectorizer failures silently skip failed objects — the batch context manager collects them in 'batch.failed_objects' but does not raise an exception. After large imports, check 'len(batch.failed_objects) == 0' and implement a retry pass for failed objects using the exact UUIDs from the error list.

Weaviate Docs →Python Client v4 Reference →GitHub — weaviate/weaviate →

Running FAISS indexes stitched together with a separate Postgres metadata store, hitting OpenAI rate limits during batch import because your embedding pipeline fires individual API calls per document, or losing search quality because filtering happens post-retrieval and collapses your effective recall set?

We design and implement production Weaviate deployments — from collection schema design with vectorizer module selection and property tokenization configuration for hybrid BM25 and vector search, through Python client v4 batch import pipelines with dynamic batching and gRPC transport, hybrid search tuning with alpha parameter and RELATIVE_SCORE fusion, named vector schemas for multi-representation product and document catalogs, multi-tenancy architecture for SaaS workload isolation with tenant lifecycle automation, Kubernetes Helm deployment with replication factor and minimum quorum configuration, S3 backup scheduling and restore testing, API key and OIDC authentication setup, Prometheus metrics collection with Grafana dashboards for query latency and vectorizer call monitoring, and integration with your RAG pipeline including chunking strategy, re-ranking with Cohere Rerank, and generative search with OpenAI. Let’s talk.

Let's Talk

Weaviate in Production — Vector Search, GraphQL API, and Hybrid Retrieval

Why Weaviate Over a Standalone ANN Index

Installation — Docker Compose and Kubernetes Helm

Schema Design — Collections, Properties, and Vectorizer Configuration

Batch Import — Fixed-Size and Dynamic Strategies

Hybrid Search — Blending BM25 and Vector Similarity

Filtering, Sorting, and Aggregation

Named Vectors — Multiple Representations per Object

Multi-Tenancy — Isolated HNSW Indexes per Tenant

Production Configuration — Replication, Auth, and Backups

Observability — Prometheus Metrics and Query Latency Monitoring

Production Checklist

Related Articles

Need help implementing this in production?