Skip to main content

APIM Model Load Balancing Policy

Overview

This document describes the OpenAI-compatible model load balancing policy developed for Azure API Management (APIM). This policy has been integrated into the latest version of the Terraform deployment scripts and can intelligently distribute requests among multiple OpenAI-compatible backend services, providing high availability, automatic failover, and intelligent rate limiting.

Architecture Design

Overall Architecture Diagram

Core Features

1. Multi-Protocol Support

  • Azure OpenAI: Uses api-key header authentication
  • OpenAI-Compatible Services: Uses Authorization: Bearer authentication
  • Path Format Adaptation:
    • Azure OpenAI: /openai/deployments/{model}/chat/completions
    • General Format: /openai/{provider}/{model}/chat/completions

2. Intelligent Load Balancing Algorithm

Provider Grouping Mechanism

The load balancing policy first groups backends by the Provider in the request:

  • Only backend services with the same Provider participate in the same load balancing group
  • Different Providers do not load balance with each other and operate independently
  • Each Provider group independently executes priority and weight algorithms

Multi-Layer Load Balancing Strategy

Core algorithm features:

  • Priority Scheduling: Prefer high-priority backends, automatically downgrade on failure
  • Weighted Load Balancing: Distribute traffic proportionally by weight within the same priority
  • Intelligent Failure Detection: Automatically detect rate limiting and error states
  • Dynamic Recovery: Automatically reintegrate backends into load balancing after recovery

3. Intelligent Failure Handling

Failure Detection and Automatic Recovery

Core Fault Tolerance Mechanisms

  • Multi-Level Failure Detection: Supports HTTP status codes, timeouts, connection errors, etc.
  • Intelligent Retry Strategy: Up to 5 retries, automatically skip unavailable backends
  • Automatic State Recovery: Time window-based automatic recovery mechanism
  • Real-Time State Sync: Backend state changes are updated to cache in real time

Detailed Workflow

Request Processing Flow

Core Processing Stages

  1. Request Parsing: Extract Provider and Model information from the URL, supporting multiple API path formats
  2. Backend Filtering: Filter matching backends by Provider+Model combination, using cache to improve performance
  3. Load Balancing: Sort by priority, distribute traffic by weight within the same priority
  4. Request Forwarding: Configure authentication method according to backend type, adjust request path format

Configuration Management

Configuration Methods

1. Configure via APIM Named Value

  • Configuration Name: ModelEndpoints
  • Configuration Type: JSON string
  • Update Method: Takes effect in real time, no service restart required

Steps:

  1. Log in to Azure Portal → API Management instance
  2. Select the "Named values" menu
  3. Find the ModelEndpoints configuration item
  4. Edit the JSON configuration and save

2. Configure via Terraform

Suitable for automated deployment and version control scenarios.

Backend Configuration Structure

{
"provider": "provider-type",
"priority": 1,
"weight": 50,
"endpoint": "https://api-endpoint.example.com/path/{model}",
"api_key": "********",
"models": ["model-a", "model-b", "model-c"]
}

Configuration Parameter Description

ParameterTypeDescription
providerstringProvider type identifier
prioritynumberPriority (lower number means higher priority)
weightnumberWeight value (distribution ratio within the same priority)
endpointstringBackend endpoint URL ({model} placeholder will be replaced with actual model name)
api_keystringAPI authentication key
modelsarrayList of models supported by this backend

Configuration Example

APIM Named Value Configuration Example

[
{
"provider": "azure-openai",
"priority": 1,
"weight": 50,
"endpoint": "https://xxxxxxx.openai.azure.com/openai/deployments/{model}",
"api_key": "********",
"models": ["gpt-4.1", "gpt-4.1-mini"]
},
{
"provider": "azure-openai",
"priority": 2,
"weight": 30,
"endpoint": "https://yyyyyyy.openai.azure.com/openai/deployments/{model}",
"api_key": "********",
"models": ["gpt-4.1", "gpt-4.1-mini"]
},
{
"provider": "openai-compatible",
"priority": 1,
"weight": 100,
"endpoint": "https://api.example.com/v1/",
"api_key": "********",
"models": ["custom-model"]
}
]

Configuration Notes:

  • The first two configurations are for the azure-openai provider and will be in the same load balancing group
  • The third configuration is for the openai-compatible provider and will be in a separate group
  • There is no load balancing between different Provider groups

Caching Strategy

Caching Mechanism

  • Cache Duration: 30 seconds (balances performance and timely configuration updates)
  • Cache Scope: Cached separately by Provider+Model combination
  • Invalidation Condition: Automatically invalidated when configuration changes
  • Update Strategy: Lazy loading, cache generated on demand

Configuration Update Flow

  1. Configuration Change DetectionCache InvalidationDynamic RebuildZero Downtime Update

Error Handling and Retry

Retry Mechanism

  • Automatic retry conditions: HTTP 429, HTTP 5xx
  • Maximum retry attempts: 5 times
  • Retry interval: 1 second

Error Type Handling

Error CodeHandling StrategyDescription
429Mark as rate limited + retry other backendsSet recovery time according to Retry-After header
5xxRetry other backendsServer internal error, try other backends
503Return no available backend errorReturned when all backends are unavailable

Debugging Features

Debug Headers

When debug mode is enabled, the following headers will be added to the response:

X-Debug-Provider: provider-a
X-Debug-Model: model-x
X-Debug-Path: /api/model-x/chat/completions -> /chat/completions
X-Debug-Backend: 0
X-Debug-Endpoint: https://api-a.example.com/path/model-x

Enable Debug Mode

variable "enable_debug_headers" {
description = "Enable debug headers for load balancer policy"
type = bool
default = true
}

Best Practices

1. Priority Design Recommendations

Priority 1: Primary service provider (low latency, high reliability)
Priority 2: Backup service provider (different region or provider)
Priority 3: Cost-optimized service (price advantage or special models)

2. Weight Distribution Strategy

  • Backends with similar performance: Use equal weights
  • Backends with different performance: Adjust weight ratio according to performance differences
  • Cost considerations: Assign higher weights to lower-cost backends

3. Provider Grouping Strategy

  • Group by service type: Group the same type of API services under the same Provider
  • Group by region: Group Providers by region
  • Group by cost: Group Providers by cost characteristics
  • Group by function: Group Providers by model functionality

4. Monitoring and Operations

  • Enable debug headers: Enable in development and testing environments
  • Log monitoring: Monitor 503 errors and retry counts
  • Performance monitoring: Monitor cache hit rate and response time
  • Configuration monitoring: Monitor Named Value configuration changes and effectiveness

Troubleshooting

Common Issues

1. 503 Service Unavailable

Cause: No available backend supports the requested provider+model combination
Solution:

  • Check if there is a corresponding backend in the configuration
  • Verify that the model name is correct
  • Confirm whether all backends are marked as rate limited

2. Authentication Failure

Cause: API key misconfigured or expired
Solution:

  • Verify that the backend service API key is correct
  • Check if the authentication method matches the backend requirements

3. Path Parsing Error

Cause: Request path format does not meet expectations
Solution:

  • Ensure the correct path format is used
  • Check whether the provider and model parameters are passed correctly

Debugging Steps

  1. Enable debug headers: Set the EnableDebugHeaders Named Value to true
  2. Check debug information: View the X-Debug-* headers in the response
  3. Verify configuration loading: Confirm that the ModelEndpoints Named Value is configured correctly
  4. Check cache status: Monitor cache generation and invalidation
  5. Monitor backend status: Observe backend response status and retry behavior