Understanding API Rate Limits
Rate limiting prevents any single client from overwhelming a shared API infrastructure. Understanding how providers implement it — and how to handle limit errors gracefully — is essential for building reliable integrations.
1. Rate limiting strategies
Providers use different algorithms to count and enforce limits. The most common are:
Token Bucket
A bucket holds up to N tokens and refills at a steady rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket is empty, the request is rejected or queued. This allows short bursts above the sustained rate as long as tokens have accumulated.
Used by: OpenAI (RPM per tier), Anthropic, most AI inference APIs.
Sliding Window
Tracks request timestamps over a rolling time window (e.g., the last 60 seconds). More accurate than fixed windows because it avoids the boundary spike problem — a client can't double their burst by sending requests at the end of one window and the start of the next.
Used by: Stripe, Algolia, Cloudflare APIs.
Fixed Window
Counts requests within discrete time buckets (e.g., 0–60s, 60–120s). Simple to implement but vulnerable to burst traffic at window boundaries. Common in legacy APIs and simpler rate-limiting implementations.
Used by: Some payment and communication APIs on their free tiers.
Concurrent Request Limits
Some APIs limit the number of in-flight requests at any moment, regardless of rate. This is common for compute-intensive APIs (LLM inference, video processing) to prevent resource exhaustion. A 429 with a concurrent-limit message means you need to queue or reduce parallelism, not just slow your request rate.
Used by: Replicate, Hugging Face Inference API, video processing APIs.
2. Handling 429 responses
A 429 Too Many Requests response means the client has exceeded a rate limit. The response typically includes headers indicating when the client may retry.
Typical 429 response headers:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 500
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1744480000
Content-Type: application/json
{
"error": {
"type": "rate_limit_error",
"message": "Rate limit exceeded. Retry after 30 seconds."
}
}Parse and respect Retry-After before retrying:
import time, httpx
def make_request_with_retry(client, url, **kwargs):
resp = client.get(url, **kwargs)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s before retry.")
time.sleep(retry_after)
resp = client.get(url, **kwargs)
resp.raise_for_status()
return resp3. Retry and backoff patterns
Never retry immediately after a 429. Implement exponential backoff with jitter to avoid thundering herd problems when multiple clients hit the limit simultaneously.
Exponential backoff with jitter
import time, random, httpx
def backoff_retry(client, url, max_retries=5, **kwargs):
"""Exponential backoff with full jitter."""
for attempt in range(max_retries):
resp = client.request(**kwargs, url=url)
if resp.status_code not in (429, 500, 502, 503, 504):
return resp
if attempt == max_retries - 1:
resp.raise_for_status()
return resp
# Respect Retry-After if present, otherwise use exponential backoff
retry_after = resp.headers.get("Retry-After")
if retry_after:
delay = float(retry_after)
else:
# Full jitter: sleep random(0, min(cap, base * 2^attempt))
cap = 60
base = 1
delay = random.uniform(0, min(cap, base * (2 ** attempt)))
print(f"Attempt {attempt + 1} failed ({resp.status_code}). "
f"Retrying in {delay:.1f}s.")
time.sleep(delay)
return None// TypeScript: backoff retry with Retry-After support
async function fetchWithRetry(
url: string,
options: RequestInit,
maxRetries = 5
): Promise<Response> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const res = await fetch(url, options);
if (res.ok) return res;
if (![429, 500, 502, 503, 504].includes(res.status)) {
throw new Error(`Request failed: ${res.status}`);
}
if (attempt === maxRetries - 1) throw new Error("Max retries exceeded");
const retryAfter = res.headers.get("Retry-After");
const delay = retryAfter
? parseFloat(retryAfter) * 1000
: Math.min(60_000, 1000 * 2 ** attempt) * (0.5 + Math.random() * 0.5);
console.log(`Retry ${attempt + 1} after ${(delay / 1000).toFixed(1)}s`);
await new Promise((r) => setTimeout(r, delay));
}
throw new Error("Max retries exceeded");
}Proactive rate limit management
- Track
X-RateLimit-Remainingand slow down before hitting 0 - Use request queues with configurable concurrency limits in bulk operations
- Spread batch workloads across time rather than firing all at once
- Cache API responses where the data changes infrequently
- Request a rate limit increase from the provider before deploying high-traffic workloads
4. Provider rate limit comparison
Rate limits from our tracked API directory. Figures represent default tier limits from public documentation — actual limits vary by plan.
| Provider | API | Rate Limit |
|---|---|---|
| OpenAI | Chat Completions API | 500 RPM (Tier 1) |
| Anthropic | Messages API | 1,000 RPM (Tier 1) |
| Gemini API | 360 RPM (paid) | |
| Cohere | Generate API | 100 RPM (trial) |
| Replicate | Predictions API | 600 RPM |
| Hugging Face | Inference API | 30 RPM (free) |
| Stripe | Payment Intents API | 100 read RPM / 100 write RPM |
| Twilio | Messaging API | Varies by account |
| Resend | Email API | 10 RPM (free), 100+ RPM (paid) |
| Amazon Web Services | S3 API | 5,500 GET/s, 3,500 PUT/s per prefix |
| Cloudflare | R2 Storage API | Unlimited (fair use) |
| PlanetScale | Database API | 1,000 connections |
Source: public provider documentation and community reports. Not live data. See individual API pages for full details.
5. Rate limit response headers
Rate limit headers vary by provider. The most common patterns:
| Header | Meaning | Example |
|---|---|---|
| Retry-After | Seconds until retry is safe (or HTTP date) | 30 |
| X-RateLimit-Limit | Max requests allowed in the window | 500 |
| X-RateLimit-Remaining | Requests remaining in current window | 492 |
| X-RateLimit-Reset | Unix timestamp when window resets | 1744480000 |
| X-RateLimit-Reset-Requests | Seconds until request limit resets (OpenAI) | 6 |
| X-RateLimit-Reset-Tokens | Seconds until token limit resets (OpenAI) | 0 |
| RateLimit-Policy | IETF draft: describes the policy | 100;w=1;burst=200;policy="leaky bucket" |