Arabic & MENA Guide

SovereignEG is built for MENA. This guide covers best practices for Arabic NLP, dialect handling, and getting the most out of Arabic-capable models in the live catalog.

Choosing a model for Arabic

Always start from the Model Library — only live models are callable. In practice:

TaskWhere to lookWhy
Arabic conversationLive chat models with multilingual training (e.g. Qwen-class)Strong Arabic without a separate integration
Arabic + English mixedSame — sort by context window if docs are longHandles code-switching and bilingual prompts
Arabic with heavy reasoningLarger live chat models (70B class)Better instruction-following on complex Arabic tasks
Arabic retrieval (RAG)Live embedding models (e.g. BGE-M3 class)Multilingual vectors for Arabic + English corpora

Model IDs change as the catalog grows. Use GET /v1/models or the Model Library — never hardcode an id from a blog post.

System prompt in Arabic

Use an Arabic system prompt for Arabic tasks. This sets tone and language from the first token:

response = client.chat.completions.create(
    model="...",  # live model id from the catalog
    messages=[
        {
            "role": "system",
            "content": "أنت مساعد ذكي يتحدث العربية الفصحى. أجب بشكل مختصر ودقيق."
        },
        {"role": "user", "content": "ما هي أهم التحديات التي تواجه الشركات الناشئة في مصر؟"}
    ]
)

Dialect handling

Strong multilingual chat models generally follow dialect instructions when you set them in the system prompt:

# Egyptian dialect
response = client.chat.completions.create(
    model="...",
    messages=[
        {"role": "system", "content": "أنت مساعد يتحدث باللهجة المصرية."},
        {"role": "user", "content": "إيه أحسن مطاعم في القاهرة؟"}
    ]
)
 
# Gulf dialect
response = client.chat.completions.create(
    model="...",
    messages=[
        {"role": "system", "content": "أنت مساعد يتحدث باللهجة الخليجية."},
        {"role": "user", "content": "وين أروح في دبي؟"}
    ]
)

Test a few live candidates on your own prompts — dialect quality varies by model family.

Arabic tokenization

Arabic text tokenizes differently per model family. Smaller byte-level tokenizers often use more tokens per Arabic word than multilingual models trained with Arabic in the mix:

Model family"مرحبا بك في مصر" (5 words)Typical tokens
Multilingual (Qwen-class)Multilingual BPE~8
Llama-classByte-level BPE~12

Cost tip: Compare EGP-per-1M rates in the Model Library and run a short Arabic fixture through your top two live models before committing to one.

Right-to-left (RTL) display

When displaying Arabic output in your UI:

.arabic-output {
  direction: rtl;
  text-align: right;
  font-family: 'IBM Plex Arabic', 'Noto Sans Arabic', sans-serif;
  line-height: 1.8;
}

Data residency

Standard requests route through vetted model providers today. Egypt-hosted sovereign deployments are available for regulated workloads — contact us to discuss data residency. This matters for:

  • Government contracts — many GCC governments require in-region data processing
  • Banking & finance — regulatory requirements for data residency
  • Healthcare — patient data must stay in-region in many MENA jurisdictions
  • Corporate compliance — internal policies on data sovereignty

Local currency

All usage is billed in EGP by default. Dashboard shows costs in Egyptian Pounds. No USD conversion surprises.