APIM Model Load Balancing Policy
Overview
This document describes the OpenAI-compatible model load balancing policy developed for Azure API Management (APIM). This policy has been integrated into the latest version of the Terraform deployment scripts and can intelligently distribute requests among multiple OpenAI-compatible backend services, providing high availability, automatic failover, and intelligent rate limiting.
Architecture Design
Overall Architecture Diagram
Core Features
1. Multi-Protocol Support
- Azure OpenAI: Uses
api-key
header authentication - OpenAI-Compatible Services: Uses
Authorization: Bearer
authentication - Path Format Adaptation:
- Azure OpenAI:
/openai/deployments/{model}/chat/completions
- General Format:
/openai/{provider}/{model}/chat/completions
- Azure OpenAI:
2. Intelligent Load Balancing Algorithm
Provider Grouping Mechanism
The load balancing policy first groups backends by the Provider in the request:
- Only backend services with the same Provider participate in the same load balancing group
- Different Providers do not load balance with each other and operate independently
- Each Provider group independently executes priority and weight algorithms
Multi-Layer Load Balancing Strategy
Core algorithm features:
- Priority Scheduling: Prefer high-priority backends, automatically downgrade on failure
- Weighted Load Balancing: Distribute traffic proportionally by weight within the same priority
- Intelligent Failure Detection: Automatically detect rate limiting and error states
- Dynamic Recovery: Automatically reintegrate backends into load balancing after recovery
3. Intelligent Failure Handling
Failure Detection and Automatic Recovery
Core Fault Tolerance Mechanisms
- Multi-Level Failure Detection: Supports HTTP status codes, timeouts, connection errors, etc.
- Intelligent Retry Strategy: Up to 5 retries, automatically skip unavailable backends
- Automatic State Recovery: Time window-based automatic recovery mechanism
- Real-Time State Sync: Backend state changes are updated to cache in real time
Detailed Workflow
Request Processing Flow
Core Processing Stages
- Request Parsing: Extract Provider and Model information from the URL, supporting multiple API path formats
- Backend Filtering: Filter matching backends by Provider+Model combination, using cache to improve performance
- Load Balancing: Sort by priority, distribute traffic by weight within the same priority
- Request Forwarding: Configure authentication method according to backend type, adjust request path format
Configuration Management
Configuration Methods
1. Configure via APIM Named Value
- Configuration Name:
ModelEndpoints
- Configuration Type: JSON string
- Update Method: Takes effect in real time, no service restart required
Steps:
- Log in to Azure Portal → API Management instance
- Select the "Named values" menu
- Find the
ModelEndpoints
configuration item - Edit the JSON configuration and save
2. Configure via Terraform
Suitable for automated deployment and version control scenarios.
Backend Configuration Structure
{
"provider": "provider-type",
"priority": 1,
"weight": 50,
"endpoint": "https://api-endpoint.example.com/path/{model}",
"api_key": "********",
"models": ["model-a", "model-b", "model-c"]
}
Configuration Parameter Description
Parameter | Type | Description |
---|---|---|
provider | string | Provider type identifier |
priority | number | Priority (lower number means higher priority) |
weight | number | Weight value (distribution ratio within the same priority) |
endpoint | string | Backend endpoint URL ({model} placeholder will be replaced with actual model name) |
api_key | string | API authentication key |
models | array | List of models supported by this backend |
Configuration Example
APIM Named Value Configuration Example
[
{
"provider": "azure-openai",
"priority": 1,
"weight": 50,
"endpoint": "https://xxxxxxx.openai.azure.com/openai/deployments/{model}",
"api_key": "********",
"models": ["gpt-4.1", "gpt-4.1-mini"]
},
{
"provider": "azure-openai",
"priority": 2,
"weight": 30,
"endpoint": "https://yyyyyyy.openai.azure.com/openai/deployments/{model}",
"api_key": "********",
"models": ["gpt-4.1", "gpt-4.1-mini"]
},
{
"provider": "openai-compatible",
"priority": 1,
"weight": 100,
"endpoint": "https://api.example.com/v1/",
"api_key": "********",
"models": ["custom-model"]
}
]
Configuration Notes:
- The first two configurations are for the
azure-openai
provider and will be in the same load balancing group - The third configuration is for the
openai-compatible
provider and will be in a separate group - There is no load balancing between different Provider groups
Caching Strategy
Caching Mechanism
- Cache Duration: 30 seconds (balances performance and timely configuration updates)
- Cache Scope: Cached separately by Provider+Model combination
- Invalidation Condition: Automatically invalidated when configuration changes
- Update Strategy: Lazy loading, cache generated on demand
Configuration Update Flow
- Configuration Change Detection → Cache Invalidation → Dynamic Rebuild → Zero Downtime Update
Error Handling and Retry
Retry Mechanism
- Automatic retry conditions: HTTP 429, HTTP 5xx
- Maximum retry attempts: 5 times
- Retry interval: 1 second
Error Type Handling
Error Code | Handling Strategy | Description |
---|---|---|
429 | Mark as rate limited + retry other backends | Set recovery time according to Retry-After header |
5xx | Retry other backends | Server internal error, try other backends |
503 | Return no available backend error | Returned when all backends are unavailable |
Debugging Features
Debug Headers
When debug mode is enabled, the following headers will be added to the response:
X-Debug-Provider: provider-a
X-Debug-Model: model-x
X-Debug-Path: /api/model-x/chat/completions -> /chat/completions
X-Debug-Backend: 0
X-Debug-Endpoint: https://api-a.example.com/path/model-x
Enable Debug Mode
variable "enable_debug_headers" {
description = "Enable debug headers for load balancer policy"
type = bool
default = true
}
Best Practices
1. Priority Design Recommendations
Priority 1: Primary service provider (low latency, high reliability)
Priority 2: Backup service provider (different region or provider)
Priority 3: Cost-optimized service (price advantage or special models)
2. Weight Distribution Strategy
- Backends with similar performance: Use equal weights
- Backends with different performance: Adjust weight ratio according to performance differences
- Cost considerations: Assign higher weights to lower-cost backends
3. Provider Grouping Strategy
- Group by service type: Group the same type of API services under the same Provider
- Group by region: Group Providers by region
- Group by cost: Group Providers by cost characteristics
- Group by function: Group Providers by model functionality
4. Monitoring and Operations
- Enable debug headers: Enable in development and testing environments
- Log monitoring: Monitor 503 errors and retry counts
- Performance monitoring: Monitor cache hit rate and response time
- Configuration monitoring: Monitor Named Value configuration changes and effectiveness
Troubleshooting
Common Issues
1. 503 Service Unavailable
Cause: No available backend supports the requested provider+model combination
Solution:
- Check if there is a corresponding backend in the configuration
- Verify that the model name is correct
- Confirm whether all backends are marked as rate limited
2. Authentication Failure
Cause: API key misconfigured or expired
Solution:
- Verify that the backend service API key is correct
- Check if the authentication method matches the backend requirements
3. Path Parsing Error
Cause: Request path format does not meet expectations
Solution:
- Ensure the correct path format is used
- Check whether the provider and model parameters are passed correctly
Debugging Steps
- Enable debug headers: Set the
EnableDebugHeaders
Named Value totrue
- Check debug information: View the
X-Debug-*
headers in the response - Verify configuration loading: Confirm that the
ModelEndpoints
Named Value is configured correctly - Check cache status: Monitor cache generation and invalidation
- Monitor backend status: Observe backend response status and retry behavior