APIM Model Load Balancing Policy

Overview

This document describes the OpenAI-compatible model load balancing policy developed for Azure API Management (APIM). This policy has been integrated into the latest version of the Terraform deployment scripts and can intelligently distribute requests among multiple OpenAI-compatible backend services, providing high availability, automatic failover, and intelligent rate limiting.

Architecture Design

Overall Architecture Diagram

Core Features

1. Multi-Protocol Support

Azure OpenAI: Uses api-key header authentication
OpenAI-Compatible Services: Uses Authorization: Bearer authentication
Path Format Adaptation:
- Azure OpenAI: /openai/deployments/{model}/chat/completions
- General Format: /openai/{provider}/{model}/chat/completions

2. Intelligent Load Balancing Algorithm

Provider Grouping Mechanism

The load balancing policy first groups backends by the Provider in the request:

Only backend services with the same Provider participate in the same load balancing group
Different Providers do not load balance with each other and operate independently
Each Provider group independently executes priority and weight algorithms

Multi-Layer Load Balancing Strategy

Core algorithm features:

Priority Scheduling: Prefer high-priority backends, automatically downgrade on failure
Weighted Load Balancing: Distribute traffic proportionally by weight within the same priority
Intelligent Failure Detection: Automatically detect rate limiting and error states
Dynamic Recovery: Automatically reintegrate backends into load balancing after recovery

3. Intelligent Failure Handling

Failure Detection and Automatic Recovery

Core Fault Tolerance Mechanisms

Multi-Level Failure Detection: Supports HTTP status codes, timeouts, connection errors, etc.
Intelligent Retry Strategy: Up to 5 retries, automatically skip unavailable backends
Automatic State Recovery: Time window-based automatic recovery mechanism
Real-Time State Sync: Backend state changes are updated to cache in real time

Detailed Workflow

Request Processing Flow

Core Processing Stages

Request Parsing: Extract Provider and Model information from the URL, supporting multiple API path formats
Backend Filtering: Filter matching backends by Provider+Model combination, using cache to improve performance
Load Balancing: Sort by priority, distribute traffic by weight within the same priority
Request Forwarding: Configure authentication method according to backend type, adjust request path format

Configuration Management

Configuration Methods

1. Configure via APIM Named Value

Configuration Name: ModelEndpoints
Configuration Type: JSON string
Update Method: Takes effect in real time, no service restart required

Steps:

Log in to Azure Portal → API Management instance
Select the "Named values" menu
Find the ModelEndpoints configuration item
Edit the JSON configuration and save

2. Configure via Terraform

Suitable for automated deployment and version control scenarios.

Backend Configuration Structure

{
  "provider": "provider-type",
  "priority": 1,
  "weight": 50,
  "endpoint": "https://api-endpoint.example.com/path/{model}",
  "api_key": "********",
  "models": ["model-a", "model-b", "model-c"]
}

Configuration Parameter Description

Parameter	Type	Description
`provider`	string	Provider type identifier
`priority`	number	Priority (lower number means higher priority)
`weight`	number	Weight value (distribution ratio within the same priority)
`endpoint`	string	Backend endpoint URL (`{model}` placeholder will be replaced with actual model name)
`api_key`	string	API authentication key
`models`	array	List of models supported by this backend

Configuration Example

APIM Named Value Configuration Example

[
  {
    "provider": "azure-openai",
    "priority": 1,
    "weight": 50,
    "endpoint": "https://xxxxxxx.openai.azure.com/openai/deployments/{model}",
    "api_key": "********",
    "models": ["gpt-4.1", "gpt-4.1-mini"]
  },
  {
    "provider": "azure-openai",
    "priority": 2, 
    "weight": 30,
    "endpoint": "https://yyyyyyy.openai.azure.com/openai/deployments/{model}",
    "api_key": "********",
    "models": ["gpt-4.1", "gpt-4.1-mini"]
  },
  {
    "provider": "openai-compatible",
    "priority": 1,
    "weight": 100,
    "endpoint": "https://api.example.com/v1/",
    "api_key": "********",
    "models": ["custom-model"]
  }
]

Configuration Notes:

The first two configurations are for the azure-openai provider and will be in the same load balancing group
The third configuration is for the openai-compatible provider and will be in a separate group
There is no load balancing between different Provider groups

Caching Strategy

Caching Mechanism

Cache Duration: 30 seconds (balances performance and timely configuration updates)
Cache Scope: Cached separately by Provider+Model combination
Invalidation Condition: Automatically invalidated when configuration changes
Update Strategy: Lazy loading, cache generated on demand

Configuration Update Flow

Configuration Change Detection → Cache Invalidation → Dynamic Rebuild → Zero Downtime Update

Error Handling and Retry

Retry Mechanism

Automatic retry conditions: HTTP 429, HTTP 5xx
Maximum retry attempts: 5 times
Retry interval: 1 second

Error Type Handling

Error Code	Handling Strategy	Description
429	Mark as rate limited + retry other backends	Set recovery time according to Retry-After header
5xx	Retry other backends	Server internal error, try other backends
503	Return no available backend error	Returned when all backends are unavailable

Debugging Features

Debug Headers

When debug mode is enabled, the following headers will be added to the response:

X-Debug-Provider: provider-a
X-Debug-Model: model-x
X-Debug-Path: /api/model-x/chat/completions -> /chat/completions
X-Debug-Backend: 0
X-Debug-Endpoint: https://api-a.example.com/path/model-x

Enable Debug Mode

variable "enable_debug_headers" {
  description = "Enable debug headers for load balancer policy"
  type        = bool
  default     = true
}

Best Practices

1. Priority Design Recommendations

Priority 1: Primary service provider (low latency, high reliability)
Priority 2: Backup service provider (different region or provider)
Priority 3: Cost-optimized service (price advantage or special models)

2. Weight Distribution Strategy

Backends with similar performance: Use equal weights
Backends with different performance: Adjust weight ratio according to performance differences
Cost considerations: Assign higher weights to lower-cost backends

3. Provider Grouping Strategy

Group by service type: Group the same type of API services under the same Provider
Group by region: Group Providers by region
Group by cost: Group Providers by cost characteristics
Group by function: Group Providers by model functionality

4. Monitoring and Operations

Enable debug headers: Enable in development and testing environments
Log monitoring: Monitor 503 errors and retry counts
Performance monitoring: Monitor cache hit rate and response time
Configuration monitoring: Monitor Named Value configuration changes and effectiveness

Troubleshooting

Common Issues

1. 503 Service Unavailable

Cause: No available backend supports the requested provider+model combination
Solution:

Check if there is a corresponding backend in the configuration
Verify that the model name is correct
Confirm whether all backends are marked as rate limited

2. Authentication Failure

Cause: API key misconfigured or expired
Solution:

Verify that the backend service API key is correct
Check if the authentication method matches the backend requirements

3. Path Parsing Error

Cause: Request path format does not meet expectations
Solution:

Ensure the correct path format is used
Check whether the provider and model parameters are passed correctly

Debugging Steps

Enable debug headers: Set the EnableDebugHeaders Named Value to true
Check debug information: View the X-Debug-* headers in the response
Verify configuration loading: Confirm that the ModelEndpoints Named Value is configured correctly
Check cache status: Monitor cache generation and invalidation
Monitor backend status: Observe backend response status and retry behavior

Overview​

Architecture Design​

Overall Architecture Diagram​

Core Features​

1. Multi-Protocol Support​

2. Intelligent Load Balancing Algorithm​

Provider Grouping Mechanism​

Multi-Layer Load Balancing Strategy​

3. Intelligent Failure Handling​

Failure Detection and Automatic Recovery​

Core Fault Tolerance Mechanisms​

Detailed Workflow​

Request Processing Flow​

Core Processing Stages​

Configuration Management​

Configuration Methods​

1. Configure via APIM Named Value​

2. Configure via Terraform​

Backend Configuration Structure​

Configuration Parameter Description​

Configuration Example​

APIM Named Value Configuration Example​

Caching Strategy​

Caching Mechanism​

Configuration Update Flow​

Error Handling and Retry​

Retry Mechanism​

Error Type Handling​

Debugging Features​

Debug Headers​

Enable Debug Mode​

Best Practices​

1. Priority Design Recommendations​

2. Weight Distribution Strategy​

3. Provider Grouping Strategy​

4. Monitoring and Operations​

Troubleshooting​

Common Issues​

1. 503 Service Unavailable​

2. Authentication Failure​

3. Path Parsing Error​

Debugging Steps​