Streaming Responses Guide

This guide covers best practices for handling streaming responses across different platforms and languages.

Why stream?

  • Better UX: Users see tokens appear in real time instead of waiting 2-5 seconds for a full response
  • Lower perceived latency: Users see the first token as soon as the model starts generating (latency varies by model and region)
  • Memory efficient: Process tokens as they arrive instead of buffering the full response

Python (OpenAI SDK)

stream = client.chat.completions.create(
    model="...",
    messages=[{"role": "user", "content": "Explain transformers"}],
    stream=True
)
 
full_response = ""
for chunk in stream:
    content = chunk.choices[0].delta.content or ""
    full_response += content
    print(content, end="", flush=True)

Python (async)

import asyncio
 
async def stream_response():
    stream = await async_client.chat.completions.create(
        model="...",
        messages=[{"role": "user", "content": "Hello"}],
        stream=True
    )
 
    async for chunk in stream:
        content = chunk.choices[0].delta.content or ""
        print(content, end="", flush=True)
 
asyncio.run(stream_response())

Node.js

const stream = await client.chat.completions.create({
  model: '...',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true,
});
 
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

React (Next.js)

'use client'
import { useState } from 'react'
 
export function Chat() {
  const [output, setOutput] = useState('')
  const [loading, setLoading] = useState(false)
 
  async function handleSubmit(prompt: string) {
    setLoading(true)
    setOutput('')
 
    const response = await fetch('/api/chat', {
      method: 'POST',
      body: JSON.stringify({ prompt }),
    })
 
    const reader = response.body!.getReader()
    const decoder = new TextDecoder()
 
    while (true) {
      const { done, value } = await reader.read()
      if (done) break
 
      const text = decoder.decode(value)
      setOutput(prev => prev + text)
    }
 
    setLoading(false)
  }
 
  return <div>{output}{loading && <span className="animate-pulse">▊</span>}</div>
}

Error handling

Always handle stream interruptions:

try:
    for chunk in stream:
        content = chunk.choices[0].delta.content or ""
        process(content)
except Exception as e:
    print(f"Stream interrupted: {e}")
    # Optionally retry or fall back to non-streaming

Best practices

  1. Always flush output — use flush=True in Python or process.stdout.write in Node
  2. Show a cursor — display a blinking cursor while streaming for better UX
  3. Handle [DONE] — the stream ends with data: [DONE]; your parser must handle this
  4. Set timeouts — if no token arrives in 30 seconds, the connection may be stale
  5. Buffer by word — for display, buffer until a space character for smoother text rendering