Choosing a Model
The right model depends on your task, budget, and latency requirements. Always start from the Model Library or GET /v1/models — only live models are callable, and IDs change as the catalog grows.
Quick decision tree
Do you need Arabic as the primary language? → Filter the Model Library for live chat models. Multilingual models (Qwen-class, etc.) are a common starting point.
Do you need a context window larger than 8K tokens? → Sort the Model Library by context window and pick a live model that fits your document.
Is speed and cost your top priority? → Sort by input price ascending and test the cheapest live chat model first.
Do you need the best quality for complex tasks? → Try a larger live model (70B class or similar). Benchmark on your own prompts before committing.
By use case
| Use case | Where to look | Why |
|---|---|---|
| Customer support chatbot | Cheapest live chat model | Fast loops; upgrade if quality falls short |
| Arabic customer support | Live multilingual chat models | Strong Arabic without a separate integration |
| Code generation | Larger live chat models | Better instruction-following on code |
| Document summarization | Small/mid chat models under 8K context | Fast; fits most short docs |
| Long document analysis | Highest context window in catalog | Fit the full doc in one request |
| Creative writing | Larger chat models | More nuanced tone |
| Data extraction / JSON | Models with Tools badge (if using function calling) | Structured output |
| Translation | Multilingual chat models | Cross-language quality |
| High-throughput batch | Lowest input price per 1M tokens | Minimise cost at scale |
| Embeddings / RAG | Live embed models | Vectors for retrieval — separate endpoint from chat |
| Research / reasoning | Largest live chat model you can afford | Test on your hardest prompts |
Cost optimization tips
- Start cheap. Test with the lowest-priced live chat model; upgrade only when quality isn't enough.
- Develop on a small model. Iterate fast, then switch for production if needed.
- Set
max_tokens. Don't let the model generate more than you need. - Reuse system prompts. Keep a stable system message so repeated context isn't re-sent unnecessarily.
- Batch when possible. Parallel requests use your RPM budget efficiently — see Rate Limits.
Switching models
The API is identical across models — change one string:
response = client.chat.completions.create(
model="...", # any live id from GET /v1/models
messages=same_messages,
temperature=same_temperature,
)See Models Overview for the live comparison table.