Free LLM Router
5 free inference APIs. 220+ RPM aggregate. $0 spend.
The only router that steers before the 429 — not after.
Every LLM routing tool — OpenRouter, RouteLLM, OmniRoute — treats a 429 as a provider outage and falls over to the next option. That's the wrong model for free-tier APIs.
Groq isn't down at 30 RPM. It has capacity — just not for you right now.
The problem isn't finding a provider that's up. It's spreading load across the aggregate free budget before any single provider hits its ceiling. Read the x-ratelimit-remaining headers on every response. Steer away before the 429 lands. Pool five providers and you have 220+ RPM at $0 — not because you found one fast provider, but because you're managing a shared budget across all of them simultaneously.
No existing router does this. They're built for paid APIs where a 429 means "try someone else." Free-tier APIs need a different approach entirely.
A local proxy that speaks the OpenAI API format. Drop it in front of any tool that uses an LLM — Claude Code, Cursor, LangChain, your own scripts.
.env file. The setup wizard validates each key, makes a test call, and reports the rate limit ceiling per provider. Under 15 minutes from cold start.
x-ratelimit-remaining and x-ratelimit-reset are ingested in real time. Before each new request, the router steers away from near-limit providers — not after a 429 forces it to.
Each provider is rate-limited individually. The router treats them as a single shared budget — pooling their capacity so you're never blocked waiting for one to recover.
| Provider | Free RPM | Model | Production use? |
|---|---|---|---|
| Groq | 30 | Llama 3.3 70B | ✓ |
| Cerebras | 30 | Qwen 3 235B | ✓ |
| DeepSeek | 60 | DeepSeek Chat | ✓ |
| Mistral | 60 | Mistral Small | dev / eval tier |
| Gemini | ~60 | Gemini 2.5 Pro | ✓ (1M ctx) |
| Total | 220+ | load-balanced | ✓ |
Keys are your own — never shared. Mistral's free tier is marked dev/eval by their ToS; upgrade to Scale for production workloads.
At 220 RPM with a 2,000-token average request, the pool delivers 26 million tokens per hour. Running two active hours per working day across the year, that's 13.7 billion free tokens — valued at Claude Haiku API rates.
How the $28K is calculated: 13.7B tokens at Claude Haiku 3.5 pricing ($0.80/M input · $4/M output, assuming a 60/40 input/output split) = $6,576 input + $21,920 output = $28,496/year in equivalent API spend — eliminated entirely.
Other routers target paid APIs. Their signals are cost-per-token and latency. A 429 is an outage to them — they have no concept of budget steering.
| Tool | Free-tier aware? | Steers before 429? | Reads ratelimit headers? |
|---|---|---|---|
| OpenRouter | ✗ (paid focus) | ✗ | ✗ |
| RouteLLM | ✗ | ✗ | ✗ |
| OmniRoute | partial | ✗ (reacts after) | ✗ |
| LiteLLM (default) | ✗ | ✗ | ✗ |
| Free LLM Router | ✓ built for it | ✓ | ✓ real-time |
Ships as a Docker Compose bundle + Claude Code MCP server. One command to start.
Early access
Early access list. No spam. One email when it's ready.
spawn_free_agent tool that Claude Code can call directly. Reads and research happen in a free-tier loop outside your Claude Max context window, keeping it lean for the work only Claude can do.
x-ratelimit-remaining on every response and steers traffic before the limit lands. Under bursty parallel workloads — 5–10 simultaneous agents — the difference is significant: OpenRouter queues and retries; this router distributes upfront and rarely retries at all.