Understanding API Rate Limits

Rate limiting prevents any single client from overwhelming a shared API infrastructure. Understanding how providers implement it — and how to handle limit errors gracefully — is essential for building reliable integrations.

1. Rate limiting strategies

Providers use different algorithms to count and enforce limits. The most common are:

Token Bucket

A bucket holds up to N tokens and refills at a steady rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected or queued. This allows short bursts above the sustained rate as long as tokens have accumulated.

Used by: OpenAI (RPM per tier), Anthropic, most AI inference APIs.

Sliding Window

Tracks request timestamps over a rolling time window (e.g., the last 60 seconds). More accurate than fixed windows because it avoids the boundary spike problem — a client can't double their burst by sending requests at the end of one window and the start of the next.

Used by: Stripe, Algolia, Cloudflare APIs.

Fixed Window

Counts requests within discrete time buckets (e.g., 0–60s, 60–120s). Simple to implement but vulnerable to burst traffic at window boundaries. Common in legacy APIs and simpler rate-limiting implementations.

Used by: Some payment and communication APIs on their free tiers.

Concurrent Request Limits

Some APIs limit the number of in-flight requests at any moment, regardless of rate. This is common for compute-intensive APIs (LLM inference, video processing) to prevent resource exhaustion. A 429 with a concurrent-limit message means you need to queue or reduce parallelism, not just slow your request rate.

Used by: Replicate, Hugging Face Inference API, video processing APIs.

2. Handling 429 responses

A 429 Too Many Requests response means the client has exceeded a rate limit. The response typically includes headers indicating when the client may retry.

Typical 429 response headers:

http

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 500
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1744480000
Content-Type: application/json

{
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded. Retry after 30 seconds."
  }
}

Parse and respect Retry-After before retrying:

python

import time, httpx

def make_request_with_retry(client, url, **kwargs):
    resp = client.get(url, **kwargs)
    if resp.status_code == 429:
        retry_after = int(resp.headers.get("Retry-After", 60))
        print(f"Rate limited. Waiting {retry_after}s before retry.")
        time.sleep(retry_after)
        resp = client.get(url, **kwargs)
    resp.raise_for_status()
    return resp

3. Retry and backoff patterns

Never retry immediately after a 429. Implement exponential backoff with jitter to avoid thundering herd problems when multiple clients hit the limit simultaneously.

Exponential backoff with jitter

python

import time, random, httpx

def backoff_retry(client, url, max_retries=5, **kwargs):
    """Exponential backoff with full jitter."""
    for attempt in range(max_retries):
        resp = client.request(**kwargs, url=url)

        if resp.status_code not in (429, 500, 502, 503, 504):
            return resp

        if attempt == max_retries - 1:
            resp.raise_for_status()
            return resp

        # Respect Retry-After if present, otherwise use exponential backoff
        retry_after = resp.headers.get("Retry-After")
        if retry_after:
            delay = float(retry_after)
        else:
            # Full jitter: sleep random(0, min(cap, base * 2^attempt))
            cap = 60
            base = 1
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))

        print(f"Attempt {attempt + 1} failed ({resp.status_code}). "
              f"Retrying in {delay:.1f}s.")
        time.sleep(delay)

    return None

typescript

// TypeScript: backoff retry with Retry-After support
async function fetchWithRetry(
  url: string,
  options: RequestInit,
  maxRetries = 5
): Promise<Response> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const res = await fetch(url, options);

    if (res.ok) return res;
    if (![429, 500, 502, 503, 504].includes(res.status)) {
      throw new Error(`Request failed: ${res.status}`);
    }
    if (attempt === maxRetries - 1) throw new Error("Max retries exceeded");

    const retryAfter = res.headers.get("Retry-After");
    const delay = retryAfter
      ? parseFloat(retryAfter) * 1000
      : Math.min(60_000, 1000 * 2 ** attempt) * (0.5 + Math.random() * 0.5);

    console.log(`Retry ${attempt + 1} after ${(delay / 1000).toFixed(1)}s`);
    await new Promise((r) => setTimeout(r, delay));
  }
  throw new Error("Max retries exceeded");
}

Proactive rate limit management

Track X-RateLimit-Remaining and slow down before hitting 0
Use request queues with configurable concurrency limits in bulk operations
Spread batch workloads across time rather than firing all at once
Cache API responses where the data changes infrequently
Request a rate limit increase from the provider before deploying high-traffic workloads

4. Provider rate limit comparison

Rate limits from our tracked API directory. Figures represent default tier limits from public documentation — actual limits vary by plan.

Provider	API	Rate Limit
OpenAI	Chat Completions API	500 RPM (Tier 1)
Anthropic	Messages API	1,000 RPM (Tier 1)
Google	Gemini API	360 RPM (paid)
Cohere	Generate API	100 RPM (trial)
Replicate	Predictions API	600 RPM
Hugging Face	Inference API	30 RPM (free)
Stripe	Payment Intents API	100 read RPM / 100 write RPM
Twilio	Messaging API	Varies by account
Resend	Email API	10 RPM (free), 100+ RPM (paid)
Amazon Web Services	S3 API	5,500 GET/s, 3,500 PUT/s per prefix
Cloudflare	R2 Storage API	Unlimited (fair use)
PlanetScale	Database API	1,000 connections

Source: public provider documentation and community reports. Not live data. See individual API pages for full details.

5. Rate limit response headers

Rate limit headers vary by provider. The most common patterns:

Header	Meaning	Example
Retry-After	Seconds until retry is safe (or HTTP date)	30
X-RateLimit-Limit	Max requests allowed in the window	500
X-RateLimit-Remaining	Requests remaining in current window	492
X-RateLimit-Reset	Unix timestamp when window resets	1744480000
X-RateLimit-Reset-Requests	Seconds until request limit resets (OpenAI)	6
X-RateLimit-Reset-Tokens	Seconds until token limit resets (OpenAI)	0
RateLimit-Policy	IETF draft: describes the policy	100;w=1;burst=200;policy="leaky bucket"

Previous guide

API Authentication Patterns

Next guide

How to Choose the Right API