Building AI-Powered Web Applications: Streaming, RAG & UX

Introduction

The integration of Large Language Models (LLMs) has opened up new possibilities for web software, allowing us to build intelligent roadmaps, context-aware chatbots, and automated documentation editors. However, integrating AI features involves distinct architectural differences compared to building standard database-backed CRUD endpoints.

When building AI applications, developers must manage significant model API latencies, structure prompts to prevent formatting drift, and connect local data contexts securely. In this article, I will cover the design patterns required to build AI-powered web applications: streaming tokens over Server-Sent Events (SSE), configuring RAG with PostgreSQL pgvector, and optimizing prompt architectures.

Managing LLM Latency

Generating a response from a model like GPT-4 can take anywhere from 3 to 15 seconds. If a web application blocks page rendering until the entire response is generated, users will perceive the application as sluggish and unresponsive.

To address this latency challenge, we stream the response. As the model generates tokens, the backend forwards them immediately to the frontend. This reduces Time-to-First-Token (TTFT) to under 200 milliseconds, keeping the interface highly interactive.

Streaming with Server-Sent Events (SSE) in FastAPI

FastAPI handles streaming responses natively using Starlette's StreamingResponse class. Rather than using WebSockets (which introduces substantial protocol handshake overhead), we use Server-Sent Events (SSE). SSE establishes a unidirectional text stream over a standard HTTP connection.

Below is the streaming endpoint pattern:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI
import asyncio

app = FastAPI()

async def token_generator(user_input: str):
    llm = ChatOpenAI(model="gpt-4o-mini", streaming=True)
    # Stream tokens asynchronously from LangChain
    async for chunk in llm.astream(user_input):
        # Format chunk according to the Server-Sent Events spec
        yield f"data: {chunk.content}\n\n"
    yield "data: [DONE]\n\n"

@app.get("/api/chat/stream")
async def chat_stream(message: str):
    return StreamingResponse(token_generator(message), media_type="text/event-stream")

Retrieval-Augmented Generation (RAG)

LLMs are pre-trained on public data and lack context regarding private client documents, custom code repositories, or localized service guidelines. To address this, we use Retrieval-Augmented Generation (RAG).

Instead of fine-tuning the model (which is expensive and difficult to update), RAG fetches relevant document segments from a database and appends them to the prompt context before querying the model.

Vector Storage using pgvector in Supabase

To retrieve relevant document segments, we convert text into mathematical coordinate vectors (embeddings) and store them in a vector database. We use the pgvector extension for PostgreSQL.

Below is the database schema used to index and query document embeddings:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table to store document segments and embeddings
CREATE TABLE doc_embeddings (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  content text NOT NULL,
  embedding vector(1536) NOT NULL -- 1536 dimensions for OpenAI text-embedding-3-small
);

-- Index the embeddings using HNSW for fast cosine similarity search
CREATE INDEX ON doc_embeddings USING hnsw (embedding vector_cosine_ops);

-- Database function to perform similarity queries
CREATE OR REPLACE FUNCTION match_documents (
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
RETURNS TABLE (id uuid, content text, similarity float)
LANGUAGE plpgsql AS $$
BEGIN
  RETURN QUERY
  SELECT
    doc_embeddings.id,
    doc_embeddings.content,
    1 - (doc_embeddings.embedding <=> query_embedding) AS similarity
  FROM doc_embeddings
  WHERE 1 - (doc_embeddings.embedding <=> query_embedding) > match_threshold
  ORDER BY doc_embeddings.embedding <=> query_embedding
  LIMIT match_count;
END;
$$;

Prompt Engineering Guidelines

Prompt drift occurs when slight adjustments to user input cause the model to deviate from the expected output format. To prevent formatting issues in frontend components, system prompts must be strictly defined and validated.

Isolate Roles: Use system messages to define the model's role and rules, reserving user messages for input payloads.
Enforce Output Formats: Use strict schema-binding APIs (like OpenAI's structured outputs or Pydantic parsers) to guarantee JSON compatibility.
Provide Examples: Implement few-shot examples within the system prompt to guide response tone and structural formatting.

Conclusion

Building AI-powered web applications requires designing for latency and context. Implementing Server-Sent Events ensures that user interfaces remain responsive, while RAG patterns using vector databases allow models to access private documents securely.