Building Real-Time Systems at Scale: Lessons from the Trenches
I’ve spent a significant portion of my career working with real-time data systems. From building GraphQL subscriptions at Meta to designing event-driven architectures at startups, I’ve learned that real-time systems have their own unique set of challenges. Here’s what I’ve learned.
What Makes Real-Time Different?
In traditional request-response systems, the client asks, and the server answers. Real-time systems flip this model — the server needs to push data to clients whenever something interesting happens.
This simple inversion creates a cascade of challenges:
- Connection management: Maintaining millions of persistent connections
- State management: Tracking what each client cares about
- Message routing: Getting the right message to the right clients
- Reliability: Handling disconnects, retries, and ordering
The Architecture Evolution
Stage 1: The Naive Approach
When I built my first real-time feature, I did what everyone does:
// Server pseudo-code
const clients = new Map();
socket.on('connection', (client) => {
clients.set(client.id, client);
database.onChange((data) => {
clients.forEach(c => c.send(data));
});
});
This works fine for demos. It explodes at scale. Why?
- Every database change goes to every client
- No way to scale horizontally (clients stuck to one server)
- Memory grows linearly with connections
Stage 2: Pub/Sub
The first evolution is adding a pub/sub layer:
Client → Server → Redis Pub/Sub → All Servers → Clients
Now servers can scale horizontally — each server subscribes to Redis and forwards messages to its connected clients. But we still have the “every message to every client” problem.
Stage 3: Topic-Based Routing
The key insight is that clients don’t care about all updates. They care about specific topics:
// Client subscribes to specific resources
subscribe('order:12345');
subscribe('user:prathap:notifications');
// Server only routes relevant messages
const topic = 'order:12345';
pubsub.publish(topic, { status: 'delivered' });
This is where GraphQL subscriptions shine. The subscription query itself defines exactly what the client wants:
subscription OrderUpdates($orderId: ID!) {
orderStatusChanged(orderId: $orderId) {
status
updatedAt
estimatedDelivery
}
}
Lessons Learned
1. Connections Are Expensive
Each persistent connection consumes:
- A file descriptor on the server
- Memory for buffers and state
- Keep-alive overhead
At scale, this adds up fast. Techniques that help:
- Connection pooling at the edge (CDN/load balancer level)
- Aggressive timeouts for idle connections
- Connection coalescing — multiplexing multiple subscriptions over one connection
2. Ordering is Harder Than You Think
Imagine this sequence:
- User updates their name to “Prathap”
- System sends notification: “Name changed to Prathap”
- Network delay — notification arrives first
- User’s screen shows old name with “Name changed” message
Real-time systems need explicit ordering guarantees:
// Include sequence numbers
{
seq: 42,
event: 'name_changed',
data: { name: 'Prathap' }
}
// Client maintains last seen seq
// Requests out-of-order events from server
3. Exactly-Once is a Lie
In distributed systems, you get at-most-once or at-least-once. Never exactly-once. Design for idempotency:
// Bad: Side effects on every message
onMessage((event) => {
balance += event.amount; // Double-credited on retry!
});
// Good: Idempotent handlers
onMessage((event) => {
if (processedEvents.has(event.id)) return;
processedEvents.add(event.id);
balance = recalculateFromSource();
});
4. Graceful Degradation Saves Lives
Real-time features should fail gracefully, not catastrophically. When your WebSocket server goes down:
- Fall back to polling — worse experience, but works
- Queue messages — deliver when reconnected
- Show stale data with “Last updated X minutes ago”
Users understand temporary delays. They don’t understand broken apps.
5. Rate Limiting is Non-Negotiable
One misbehaving client can take down your entire real-time infrastructure. Implement limits at every layer:
// Per-client message rate limit
const limiter = new RateLimiter({
tokensPerSecond: 10,
bucketSize: 100
});
// Per-topic subscription limit
const MAX_SUBSCRIPTIONS_PER_CLIENT = 100;
// Backpressure — slow down if client can't keep up
if (client.bufferSize > THRESHOLD) {
client.pause();
}
The Tools That Helped
Over the years, these technologies have been invaluable:
- Redis Pub/Sub — Great for getting started
- Kafka — When you need durability and replay
- GraphQL Subscriptions — Perfect for web/mobile clients
- gRPC streams — For server-to-server real-time
- Socket.io / ws — Battle-tested WebSocket libraries
Monitoring Real-Time Systems
You can’t fix what you can’t measure. Key metrics to track:
| Metric | Why It Matters |
|---|---|
| Active connections | Capacity planning |
| Message latency (p50, p99) | User experience |
| Messages per second | Throughput limits |
| Subscription count per topic | Hot spots |
| Reconnection rate | Client stability |
Final Thoughts
Real-time systems are incredibly rewarding to build. There’s something magical about seeing data flow instantly from server to client. But they require careful thought about failure modes, scaling characteristics, and operational complexity.
Start simple, add complexity only when needed, and always have a fallback plan. Your future self, debugging at 3 AM, will thank you.
Have you built real-time systems? I’d love to hear about your experiences and lessons learned.