Understanding API Rate Limits

Rate limiting prevents any single client from overwhelming a shared API infrastructure. Understanding how providers implement it — and how to handle limit errors gracefully — is essential for building reliable integrations.

1. Rate limiting strategies

Providers use different algorithms to count and enforce limits. The most common are:

Token Bucket

A bucket holds up to N tokens and refills at a steady rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected or queued. This allows short bursts above the sustained rate as long as tokens have accumulated.

Used by: OpenAI (RPM per tier), Anthropic, most AI inference APIs.

Sliding Window

Tracks request timestamps over a rolling time window (e.g., the last 60 seconds). More accurate than fixed windows because it avoids the boundary spike problem — a client can't double their burst by sending requests at the end of one window and the start of the next.

Used by: Stripe, Algolia, Cloudflare APIs.

Fixed Window

Counts requests within discrete time buckets (e.g., 0–60s, 60–120s). Simple to implement but vulnerable to burst traffic at window boundaries. Common in legacy APIs and simpler rate-limiting implementations.

Used by: Some payment and communication APIs on their free tiers.

Concurrent Request Limits

Some APIs limit the number of in-flight requests at any moment, regardless of rate. This is common for compute-intensive APIs (LLM inference, video processing) to prevent resource exhaustion. A 429 with a concurrent-limit message means you need to queue or reduce parallelism, not just slow your request rate.

Used by: Replicate, Hugging Face Inference API, video processing APIs.

2. Handling 429 responses

A 429 Too Many Requests response means the client has exceeded a rate limit. The response typically includes headers indicating when the client may retry.

Typical 429 response headers:

http
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 500
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1744480000
Content-Type: application/json

{
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded. Retry after 30 seconds."
  }
}

Parse and respect Retry-After before retrying:

python
import time, httpx

def make_request_with_retry(client, url, **kwargs):
    resp = client.get(url, **kwargs)
    if resp.status_code == 429:
        retry_after = int(resp.headers.get("Retry-After", 60))
        print(f"Rate limited. Waiting {retry_after}s before retry.")
        time.sleep(retry_after)
        resp = client.get(url, **kwargs)
    resp.raise_for_status()
    return resp

3. Retry and backoff patterns

Never retry immediately after a 429. Implement exponential backoff with jitter to avoid thundering herd problems when multiple clients hit the limit simultaneously.

Exponential backoff with jitter

python
import time, random, httpx

def backoff_retry(client, url, max_retries=5, **kwargs):
    """Exponential backoff with full jitter."""
    for attempt in range(max_retries):
        resp = client.request(**kwargs, url=url)

        if resp.status_code not in (429, 500, 502, 503, 504):
            return resp

        if attempt == max_retries - 1:
            resp.raise_for_status()
            return resp

        # Respect Retry-After if present, otherwise use exponential backoff
        retry_after = resp.headers.get("Retry-After")
        if retry_after:
            delay = float(retry_after)
        else:
            # Full jitter: sleep random(0, min(cap, base * 2^attempt))
            cap = 60
            base = 1
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))

        print(f"Attempt {attempt + 1} failed ({resp.status_code}). "
              f"Retrying in {delay:.1f}s.")
        time.sleep(delay)

    return None
typescript
// TypeScript: backoff retry with Retry-After support
async function fetchWithRetry(
  url: string,
  options: RequestInit,
  maxRetries = 5
): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const res = await fetch(url, options);

    if (res.ok) return res;
    if (![429, 500, 502, 503, 504].includes(res.status)) {
      throw new Error(`Request failed: ${res.status}`);
    }
    if (attempt === maxRetries - 1) throw new Error("Max retries exceeded");

    const retryAfter = res.headers.get("Retry-After");
    const delay = retryAfter
      ? parseFloat(retryAfter) * 1000
      : Math.min(60_000, 1000 * 2 ** attempt) * (0.5 + Math.random() * 0.5);

    console.log(`Retry ${attempt + 1} after ${(delay / 1000).toFixed(1)}s`);
    await new Promise((r) => setTimeout(r, delay));
  }
  throw new Error("Max retries exceeded");
}

Proactive rate limit management

  • Track X-RateLimit-Remaining and slow down before hitting 0
  • Use request queues with configurable concurrency limits in bulk operations
  • Spread batch workloads across time rather than firing all at once
  • Cache API responses where the data changes infrequently
  • Request a rate limit increase from the provider before deploying high-traffic workloads

4. Provider rate limit comparison

Rate limits from our tracked API directory. Figures represent default tier limits from public documentation — actual limits vary by plan.

ProviderAPIRate Limit
OpenAIChat Completions API500 RPM (Tier 1)
AnthropicMessages API1,000 RPM (Tier 1)
GoogleGemini API360 RPM (paid)
CohereGenerate API100 RPM (trial)
ReplicatePredictions API600 RPM
Hugging FaceInference API30 RPM (free)
StripePayment Intents API100 read RPM / 100 write RPM
TwilioMessaging APIVaries by account
ResendEmail API10 RPM (free), 100+ RPM (paid)
Amazon Web ServicesS3 API5,500 GET/s, 3,500 PUT/s per prefix
CloudflareR2 Storage APIUnlimited (fair use)
PlanetScaleDatabase API1,000 connections

Source: public provider documentation and community reports. Not live data. See individual API pages for full details.

5. Rate limit response headers

Rate limit headers vary by provider. The most common patterns:

HeaderMeaningExample
Retry-AfterSeconds until retry is safe (or HTTP date)30
X-RateLimit-LimitMax requests allowed in the window500
X-RateLimit-RemainingRequests remaining in current window492
X-RateLimit-ResetUnix timestamp when window resets1744480000
X-RateLimit-Reset-RequestsSeconds until request limit resets (OpenAI)6
X-RateLimit-Reset-TokensSeconds until token limit resets (OpenAI)0
RateLimit-PolicyIETF draft: describes the policy100;w=1;burst=200;policy="leaky bucket"