Building Real-Time Systems at Scale: Lessons from the Trenches

I’ve spent a significant portion of my career working with real-time data systems. From building GraphQL subscriptions at Meta to designing event-driven architectures at startups, I’ve learned that real-time systems have their own unique set of challenges. Here’s what I’ve learned.

What Makes Real-Time Different?

In traditional request-response systems, the client asks, and the server answers. Real-time systems flip this model — the server needs to push data to clients whenever something interesting happens.

This simple inversion creates a cascade of challenges:

Connection management: Maintaining millions of persistent connections
State management: Tracking what each client cares about
Message routing: Getting the right message to the right clients
Reliability: Handling disconnects, retries, and ordering

The Architecture Evolution

Stage 1: The Naive Approach

When I built my first real-time feature, I did what everyone does:

// Server pseudo-code
const clients = new Map();

socket.on('connection', (client) => {
  clients.set(client.id, client);
  
  database.onChange((data) => {
    clients.forEach(c => c.send(data));
  });
});

This works fine for demos. It explodes at scale. Why?

Every database change goes to every client
No way to scale horizontally (clients stuck to one server)
Memory grows linearly with connections

Stage 2: Pub/Sub

The first evolution is adding a pub/sub layer:

Client → Server → Redis Pub/Sub → All Servers → Clients

Now servers can scale horizontally — each server subscribes to Redis and forwards messages to its connected clients. But we still have the “every message to every client” problem.

Stage 3: Topic-Based Routing

The key insight is that clients don’t care about all updates. They care about specific topics:

// Client subscribes to specific resources
subscribe('order:12345');
subscribe('user:prathap:notifications');

// Server only routes relevant messages
const topic = 'order:12345';
pubsub.publish(topic, { status: 'delivered' });

This is where GraphQL subscriptions shine. The subscription query itself defines exactly what the client wants:

subscription OrderUpdates($orderId: ID!) {
  orderStatusChanged(orderId: $orderId) {
    status
    updatedAt
    estimatedDelivery
  }
}

Lessons Learned

1. Connections Are Expensive

Each persistent connection consumes:

A file descriptor on the server
Memory for buffers and state
Keep-alive overhead

At scale, this adds up fast. Techniques that help:

Connection pooling at the edge (CDN/load balancer level)
Aggressive timeouts for idle connections
Connection coalescing — multiplexing multiple subscriptions over one connection

2. Ordering is Harder Than You Think

Imagine this sequence:

User updates their name to “Prathap”
System sends notification: “Name changed to Prathap”
Network delay — notification arrives first
User’s screen shows old name with “Name changed” message

Real-time systems need explicit ordering guarantees:

// Include sequence numbers
{
  seq: 42,
  event: 'name_changed',
  data: { name: 'Prathap' }
}

// Client maintains last seen seq
// Requests out-of-order events from server

3. Exactly-Once is a Lie

In distributed systems, you get at-most-once or at-least-once. Never exactly-once. Design for idempotency:

// Bad: Side effects on every message
onMessage((event) => {
  balance += event.amount;  // Double-credited on retry!
});

// Good: Idempotent handlers
onMessage((event) => {
  if (processedEvents.has(event.id)) return;
  processedEvents.add(event.id);
  balance = recalculateFromSource();
});

4. Graceful Degradation Saves Lives

Real-time features should fail gracefully, not catastrophically. When your WebSocket server goes down:

Fall back to polling — worse experience, but works
Queue messages — deliver when reconnected
Show stale data with “Last updated X minutes ago”

Users understand temporary delays. They don’t understand broken apps.

5. Rate Limiting is Non-Negotiable

One misbehaving client can take down your entire real-time infrastructure. Implement limits at every layer:

// Per-client message rate limit
const limiter = new RateLimiter({
  tokensPerSecond: 10,
  bucketSize: 100
});

// Per-topic subscription limit
const MAX_SUBSCRIPTIONS_PER_CLIENT = 100;

// Backpressure — slow down if client can't keep up
if (client.bufferSize > THRESHOLD) {
  client.pause();
}

The Tools That Helped

Over the years, these technologies have been invaluable:

Redis Pub/Sub — Great for getting started
Kafka — When you need durability and replay
GraphQL Subscriptions — Perfect for web/mobile clients
gRPC streams — For server-to-server real-time
Socket.io / ws — Battle-tested WebSocket libraries

Monitoring Real-Time Systems

You can’t fix what you can’t measure. Key metrics to track:

Metric	Why It Matters
Active connections	Capacity planning
Message latency (p50, p99)	User experience
Messages per second	Throughput limits
Subscription count per topic	Hot spots
Reconnection rate	Client stability

Final Thoughts

Real-time systems are incredibly rewarding to build. There’s something magical about seeing data flow instantly from server to client. But they require careful thought about failure modes, scaling characteristics, and operational complexity.

Start simple, add complexity only when needed, and always have a fallback plan. Your future self, debugging at 3 AM, will thank you.

Have you built real-time systems? I’d love to hear about your experiences and lessons learned.