In-Memory Semantic Cache

The in-memory cache backend provides fast, local caching for development environments and single-instance deployments. It stores semantic embeddings and cached responses directly in memory for maximum performance.

Overview

The in-memory cache is ideal for:

Development and testing environments
Single-instance deployments
Quick prototyping and experimentation
Low-latency requirements where external dependencies should be minimized

Architecture

Configuration

Basic Configuration

# config/config.yaml
semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.8       # Global default threshold
  max_entries: 1000
  ttl_seconds: 3600
  eviction_policy: "fifo"

Category-Level Configuration (New)

Configure cache settings per category for fine-grained control:

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.8       # Global default
  max_entries: 1000
  ttl_seconds: 3600
  eviction_policy: "fifo"

categories:
  - name: health
    system_prompt: "You are a health expert..."
    semantic_cache_enabled: true
    semantic_cache_similarity_threshold: 0.95  # Very strict for medical accuracy
    model_scores:
      - model: your-model
        score: 0.5
        use_reasoning: false

  - name: general_chat
    system_prompt: "You are a helpful assistant..."
    semantic_cache_similarity_threshold: 0.75  # Relaxed for better hit rate
    model_scores:
      - model: your-model
        score: 0.7
        use_reasoning: false

  - name: troubleshooting
    # No cache settings - uses global default (0.8)
    model_scores:
      - model: your-model
        score: 0.7
        use_reasoning: false

Configuration Options

Parameter	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable semantic caching globally
`backend_type`	string	`"memory"`	Cache backend type (must be "memory")
`similarity_threshold`	float	`0.8`	Global minimum similarity for cache hits (0.0-1.0)
`max_entries`	integer	`1000`	Maximum number of cached entries
`ttl_seconds`	integer	`3600`	Time-to-live for cache entries (seconds, 0 = no expiration)
`eviction_policy`	string	`"fifo"`	Eviction policy: `"fifo"`, `"lru"`, `"lfu"`

Category-Level Configuration Options

Parameter	Type	Default	Description
`semantic_cache_enabled`	boolean	(inherits global)	Enable/disable caching for this category
`semantic_cache_similarity_threshold`	float	(inherits global)	Category-specific similarity threshold (0.0-1.0)

Category-level settings override global settings. If not specified, the category uses the global cache configuration.

Environment Examples

Development Environment

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.9     # Strict matching for testing
  max_entries: 500             # Small cache for development
  ttl_seconds: 1800            # 30 minutes
  eviction_policy: "fifo"

Setup and Testing

1. Enable In-Memory Cache

Update your configuration file:

# Edit config/config.yaml
cat >> config/config.yaml << EOF
semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.85
  max_entries: 1000
  ttl_seconds: 3600
EOF

2. Start the Router

# Start the semantic router
make run-router

# Or run directly
./bin/router --config config/config.yaml

3. Test Cache Functionality

Send identical requests to verify cache hits:

# First request (cache miss)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }'

# Second identical request (cache hit)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }'

# Similar request (semantic cache hit)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "Explain machine learning concepts"}]
  }'

Advantages

Ultra-low latency: Direct memory access, no network overhead
Simple setup: No external dependencies required
High throughput: Can handle thousands of cache operations per second
Immediate availability: Cache is ready as soon as the router starts

Limitations

Volatile storage: Cache is lost when the router restarts
Single instance: Cannot be shared across multiple router instances
Memory constraints: Limited by available system memory
No persistence: No data recovery after crashes

Memory Management

Automatic Cleanup

The in-memory cache automatically manages memory through:

TTL Expiration: Entries are removed after ttl_seconds
LRU Eviction: Least recently used entries are removed when max_entries is reached
Periodic Cleanup: Expired entries are cleaned every cleanup_interval_seconds
Memory Pressure: Aggressive cleanup when approaching memory_limit_mb

Next Steps

Milvus Cache - Set up persistent, distributed caching
Cache Overview - Learn about semantic caching concepts
Observability - Monitor cache performance

Overview​

Architecture​

Configuration​

Basic Configuration​

Category-Level Configuration (New)​

Configuration Options​

Category-Level Configuration Options​

Environment Examples​

Development Environment​

Setup and Testing​

1. Enable In-Memory Cache​

2. Start the Router​

3. Test Cache Functionality​

Advantages​

Limitations​

Memory Management​

Automatic Cleanup​

Next Steps​