如何将Python Web与AI后端结合，调用大模型API实现流式返回的实时聊天功能？

2026-04-27 20:401阅读0评论SEO资源

本文共计1005个文字，预计阅读时间需要5分钟。

基本原因是未响应缓存，或未设置 `stream=True` 和解码方式。

必须显式传 stream=True 给 requests.post()，否则 response.iter_lines() 或 response.iter_content() 直接报错或空返回
大模型 API（如 OpenAI、Ollama、DashScope）多数用 text/event-stream，但 requests 不自动解析 SSE，得手动按 data: 行提取内容
别用 response.json() —— 它会等全部响应结束，彻底破坏流式语义
示例关键片段：
response = requests.post(url, json=payload, stream=True) for line in response.iter_lines(): if line and line.startswith(b'data:'): try: chunk = json.loads(line[6:].decode()) yield f"data: {json.dumps(chunk)}\n\n" except json.JSONDecodeError: continue

浏览器和反向代理（如 Nginx）看到普通 text/html 或缺 Transfer-Encoding: chunked，会缓存整段响应再渲染，导致“卡住”。必须明确告诉客户端：这是流，别等。

Flask：用 Response 构造，mimetype="text/event-stream"，且加 headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"}（后者专治 Nginx 缓冲）
Django：返回 StreamingHttpResponse，content_type="text/event-stream"，同样加 Cache-Control 和 X-Accel-Buffering
别漏 yield 后的换行 —— SSE 要求每条消息以 \n\n 结尾，少一个就解析失败
本地调试时，用 curl -N http://localhost:8000/chat（-N 关闭 curl 缓冲），比浏览器更准

不是代码写错，是 async/await 混用不当。FastAPI 的 StreamingResponse 期望一个同步可迭代对象或异步生成器，但如果你在 async 函数里 yield 同步数据，或反过来，事件循环就挂起。

调用大模型 API 用 httpx.AsyncClient，不是 requests —— 后者会阻塞 event loop
流式返回函数必须是 async def，且 yield 前加 await 等待每个 chunk，例如：
async def chat_stream(): async with httpx.AsyncClient() as client: async with client.stream("POST", url, json=payload) as r: async for line in r.aiter_lines(): if line.startswith("data:"): yield line + "\n\n"
别在 StreamingResponse 里混用 time.sleep() 或其他同步阻塞操作
Uvicorn 启动时加 --timeout-keep-alive 5，避免长连接被过早断开

后端发的每条消息格式不对，或者前端没重连。SSE 协议很脆弱，错一个字符就中断连接。

立即学习“Python免费学习笔记（深入）”；

确保后端每条 yield 都是完整的一组：data: {...}\n\n，不能有空格、多出的逗号、未转义引号
前端 EventSource 默认 3s 断线重连，但大模型首 token 延迟常超这个值，加 retry: 10000（毫秒）更稳：
const es = new EventSource("/chat", { withCredentials: true }); es.addEventListener("message", e => console.log(e.data));
如果用 fetch + ReadableStream，注意 response.body.getReader() 必须在 while 循环里反复 read()，不能只读一次
Chrome 开发者工具 Network → Response Headers 里确认有 Content-Type: text/event-stream，没有就说明后端 header 没生效

流式返回最麻烦的从来不是调 API，而是中间任意一环——反向代理、WSGI/ASGI 服务器、浏览器解析逻辑——悄悄吃掉 chunk 或强行缓冲。调通之后，务必用 curl -N 和 Chrome Network 面板逐层看原始字节流。

本文共计1005个文字，预计阅读时间需要5分钟。

基本原因是未响应缓存，或未设置 `stream=True` 和解码方式。

必须显式传 stream=True 给 requests.post()，否则 response.iter_lines() 或 response.iter_content() 直接报错或空返回
大模型 API（如 OpenAI、Ollama、DashScope）多数用 text/event-stream，但 requests 不自动解析 SSE，得手动按 data: 行提取内容
别用 response.json() —— 它会等全部响应结束，彻底破坏流式语义
示例关键片段：
response = requests.post(url, json=payload, stream=True) for line in response.iter_lines(): if line and line.startswith(b'data:'): try: chunk = json.loads(line[6:].decode()) yield f"data: {json.dumps(chunk)}\n\n" except json.JSONDecodeError: continue

Flask：用 Response 构造，mimetype="text/event-stream"，且加 headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"}（后者专治 Nginx 缓冲）
Django：返回 StreamingHttpResponse，content_type="text/event-stream"，同样加 Cache-Control 和 X-Accel-Buffering
别漏 yield 后的换行 —— SSE 要求每条消息以 \n\n 结尾，少一个就解析失败
本地调试时，用 curl -N http://localhost:8000/chat（-N 关闭 curl 缓冲），比浏览器更准

调用大模型 API 用 httpx.AsyncClient，不是 requests —— 后者会阻塞 event loop
流式返回函数必须是 async def，且 yield 前加 await 等待每个 chunk，例如：
async def chat_stream(): async with httpx.AsyncClient() as client: async with client.stream("POST", url, json=payload) as r: async for line in r.aiter_lines(): if line.startswith("data:"): yield line + "\n\n"
别在 StreamingResponse 里混用 time.sleep() 或其他同步阻塞操作
Uvicorn 启动时加 --timeout-keep-alive 5，避免长连接被过早断开

后端发的每条消息格式不对，或者前端没重连。SSE 协议很脆弱，错一个字符就中断连接。

立即学习“Python免费学习笔记（深入）”；

确保后端每条 yield 都是完整的一组：data: {...}\n\n，不能有空格、多出的逗号、未转义引号
前端 EventSource 默认 3s 断线重连，但大模型首 token 延迟常超这个值，加 retry: 10000（毫秒）更稳：
const es = new EventSource("/chat", { withCredentials: true }); es.addEventListener("message", e => console.log(e.data));
如果用 fetch + ReadableStream，注意 response.body.getReader() 必须在 while 循环里反复 read()，不能只读一次
Chrome 开发者工具 Network → Response Headers 里确认有 Content-Type: text/event-stream，没有就说明后端 header 没生效