⚡ Lesson 8: Performance and Scalability Considerations

Building a system without considering performance is like designing a sports car with a lawnmower engine — it might look great, but it won't get you very far! Your SDD needs to document how your system handles load today and how it will grow tomorrow. Performance is a feature, not a fix.

🎯 Learning Objectives

By the end of this lesson, you will be able to:

Explain the difference between vertical and horizontal scaling and when to use each
Identify and document key performance metrics (latency, throughput, error rate, utilization)
Design a multi-layer caching strategy and document it in your SDD
Apply database optimization techniques and explain their trade-offs
Create a performance budget with measurable targets
Choose the right scalability pattern for your architecture (microservices, event-driven, CQRS)
Plan and document a load testing strategy

Estimated Time: 30 minutes

📑 In This Lesson

The Highway Analogy

Imagine two roads connecting the same two cities. One is a single-lane country road — every car has to wait behind the one in front of it. The other is a multi-lane highway where traffic flows in parallel. That's the difference between a system that processes requests one at a time and one designed for concurrency. Your SDD should document expected traffic patterns and how the system handles them.

⚠️ Concurrency ≠ Parallelism

Concurrency means dealing with multiple things at once (like a single chef juggling multiple orders). Parallelism means doing multiple things at once (like multiple chefs each working an order). Your SDD should specify which approach your system uses and why — async I/O (concurrency) and multi-threaded workers (parallelism) have very different resource profiles and failure modes.

📖 The Numbers That Matter

Single-threaded request processing typically tops out around 100–500 requests per second, depending on response complexity. A well-designed async or multi-threaded system can handle 10,000+ requests per second on the same hardware. The difference isn't luck — it's architecture, and architecture starts in the SDD.

Vertical vs Horizontal Scaling: Growing Up vs Growing Out

Vertical scaling means adding more power to a single machine — more CPU, more RAM, faster disks. It's simple, but it has a hard ceiling: eventually you hit hardware limits and the cost curve goes exponential. Horizontal scaling means adding more machines behind a load balancer. It's more complex, but it scales virtually without limit. Your SDD should document which strategy you're using and at what point you'd switch.

graph TB subgraph "Vertical Scaling (Scale Up)" S1[Small Server
2 CPU, 4GB RAM] --> S2[Medium Server
8 CPU, 16GB RAM] S2 --> S3[Large Server
32 CPU, 64GB RAM] S3 --> S4[Monster Server
128 CPU, 256GB RAM] S4 --> LIMIT1[Hardware Limits!] end subgraph "Horizontal Scaling (Scale Out)" H1[Server 1] --> LB[Load Balancer] H2[Server 2] --> LB H3[Server 3] --> LB H4[Server N...] --> LB LB --> SCALE[Infinite Scale!] end style S1 fill:#ffccbc style S2 fill:#ffab91 style S3 fill:#ff8a65 style S4 fill:#ff7043 style LIMIT1 fill:#ff5252 style H1 fill:#c5e1a5 style H2 fill:#c5e1a5 style H3 fill:#c5e1a5 style H4 fill:#c5e1a5 style SCALE fill:#4caf50

💡 When to Scale Up vs Scale Out

Scale Up when: your bottleneck is CPU-bound computation (video encoding, ML inference), you need strong consistency guarantees, or your application isn't designed for distributed state. It's the right first move for many early-stage systems.

Scale Out when: you need high availability (surviving server failures), your workload is stateless or can use shared state (Redis, distributed DB), or your traffic is unpredictable and you need elastic scaling. This is the long-term strategy for most web applications.

🚫 The "We'll Scale Later" Trap

Scaling out requires stateless application design — no sticky sessions, no local file storage, no in-memory caches that can't be shared. If you design with local state and try to add horizontal scaling later, you'll end up rewriting core infrastructure under pressure. Document your scaling strategy in the SDD from day one, even if you start with a single server.

Performance Metrics: What Gets Measured Gets Managed

You can't optimize what you don't measure. Your SDD should define specific, measurable performance targets for each metric — not vague aspirations like "fast" or "reliable," but concrete numbers with measurement methodology.

⚠️ Averages Lie — Use Percentiles

An average response time of 200ms might hide the fact that 1% of your users are waiting 5 seconds. Always document p50 (median), p95, and p99 latency targets in your SDD. The p99 is what your worst-off users actually experience — and they're often your most active or most important users. Netflix famously tracks p99.9 because even 0.1% of their user base is millions of people.

💡 The Four Golden Signals

Google's Site Reliability Engineering book identifies four golden signals that every SDD should define targets for:

Latency: How long requests take (separate successful from failed — a fast error is still a problem).

Traffic: How much demand your system is handling (requests/sec, concurrent connections, data volume).

Errors: The rate of requests that fail (HTTP 5xx, timeouts, incorrect results).

Saturation: How "full" your system is (CPU, memory, disk, network). At what utilization does performance degrade?

Caching Strategy: The Memory Game

Caching is the single most impactful performance optimization you can make. The idea is simple: store frequently accessed data closer to where it's needed so you don't repeat expensive computations or database queries. Your SDD should document each cache layer, its TTL, its invalidation strategy, and what happens on a cache miss.

graph LR Client[Client Request] --> Cache{Cache Check} Cache -->|Hit| CacheData[Return Cached Data
1ms] Cache -->|Miss| App[Application Server] App --> DB[(Database
50ms)] DB --> App App --> UpdateCache[Update Cache] UpdateCache --> Response[Return to Client] subgraph "Cache Layers" L1[Browser Cache
0ms] L2[CDN Cache
10ms] L3[Redis Cache
1ms] L4[Application Cache
0.1ms] end style CacheData fill:#4caf50 style Cache fill:#2196f3 style L1 fill:#e8f5e9 style L2 fill:#e3f2fd style L3 fill:#fff3e0 style L4 fill:#fce4ec

💡 Cache Invalidation: The Hard Problem

There's a famous joke: "There are only two hard things in computer science — cache invalidation and naming things." The joke persists because it's true. Every caching decision involves a trade-off between freshness and speed. Document these patterns in your SDD:

TTL-based: Set an expiration time. Simple but may serve stale data. Good for: product catalogs, blog posts, configuration.

Write-through: Update the cache every time you update the database. Consistent but slower writes. Good for: user profiles, account settings.

Write-behind: Update the cache immediately, write to DB asynchronously. Fast but risky — data loss if the queue fails. Good for: analytics events, view counts.

Event-driven: Publish an event when data changes; subscribers invalidate their caches. Best consistency but requires event infrastructure. Good for: inventory, pricing.

🚫 The "Cache Everything" Anti-Pattern

Caching data that changes frequently or is unique per user (like personalized search results) can actually hurt performance: you pay the cost of cache writes and lookups but almost never get a hit. Only cache data with a high read-to-write ratio. A good rule of thumb: if the data changes more often than it's read, don't cache it.

Real-World Example: Video Streaming Platform

Video streaming platforms like Netflix represent one of the most demanding performance challenges in software — serving terabytes of data per second to millions of concurrent users with near-zero buffering. Let's see how a Netflix-style architecture uses every performance technique we've discussed.

Scaling Netflix-Style Architecture

graph TB subgraph "Edge Layer" CDN1[CDN POP 1] CDN2[CDN POP 2] CDN3[CDN POP N...] end subgraph "API Gateway" GW[API Gateway
Rate Limiting] end subgraph "Microservices" USER[User Service] VIDEO[Video Service] REC[Recommendation Service] TRANS[Transcoding Service] end subgraph "Data Layer" CACHE[(Redis Cluster)] DB[(Cassandra)] S3[Object Storage] end CDN1 --> GW CDN2 --> GW CDN3 --> GW GW --> USER GW --> VIDEO GW --> REC USER --> CACHE VIDEO --> CACHE REC --> DB TRANS --> S3 style CDN1 fill:#4caf50 style CDN2 fill:#4caf50 style CDN3 fill:#4caf50 style CACHE fill:#ff9800

✅ Performance Strategies Used

Edge Caching (CDN POPs): Content is served from the nearest geographic location. A user in Tokyo gets video from a Tokyo data center, not from Virginia. Document your CDN strategy and which content is cached at the edge.

Adaptive Bitrate Streaming: The client dynamically adjusts video quality based on available bandwidth — no buffering, no manual quality switching. This is a design decision that belongs in the SDD.

Predictive Caching: Machine learning pre-loads episodes you're likely to watch next. The content is already on your device before you click "Next Episode."

Async Processing: Video transcoding (converting raw uploads into multiple formats and resolutions) happens in background workers, not in the request path. Users don't wait for transcoding to complete.

📖 Why This Matters for Your SDD

You're not building Netflix, but the principles apply at every scale. Even a small application benefits from CDN caching, async background jobs, and geographic awareness. Your SDD should document these decisions explicitly — future developers need to know why the architecture looks the way it does, not just what it looks like.

Database Optimization: The Art of Speed

Database queries are almost always the bottleneck in web applications. A single unoptimized query can bring an entire system to its knees under load. Your SDD should document your indexing strategy, query patterns, and optimization targets.

💡 The Optimization Checklist

Indexing: Add B-tree indexes on columns used in WHERE, JOIN, and ORDER BY clauses. But don't over-index — every index slows down writes and consumes disk space. Document which indexes exist and why.

Select Only What You Need: SELECT * is the most common performance mistake. It fetches every column, even if you only need two. Use explicit column lists.

Avoid N+1 Queries: Fetching a list of 100 items, then making a separate query for each item's details = 101 queries. Use JOINs or batch fetching. This pattern is the #1 performance killer in ORM-based applications.

Use EXPLAIN: Every major database has a query execution plan tool. Use it to verify your queries are using indexes, not doing full table scans. Document expected query plans for critical paths.

⚠️ The Read Replica Trap

Read replicas are a powerful tool — send reads to replicas, writes to the primary. But replicas have replication lag: a user who just updated their profile might see stale data if the next page load hits a replica that hasn't caught up. Document your consistency model in the SDD: is it eventual consistency? If so, what's the maximum acceptable lag? What operations require strong consistency (reading from the primary)?

🚫 The "We'll Optimize When It's Slow" Fallacy

By the time your database is slow in production, you're in crisis mode. The time to optimize is during design. Set query time budgets (no query over 100ms), identify which tables will grow fastest, and plan your sharding or partitioning strategy before you have a million rows. An ounce of SDD planning prevents a pound of 3 AM pages.

Load Testing: Stress Testing Your System

Load testing tells you where your system breaks before your users find out. Your SDD should document what type of load testing you perform, how often, and what the acceptance criteria are.

graph TD A[Load Testing Strategy] --> B[Baseline Test] A --> C[Stress Test] A --> D[Spike Test] A --> E[Endurance Test] B --> B1[Normal load
100 users] C --> C1[Breaking point
10,000 users] D --> D1[Sudden surge
100 → 5000 users] E --> E1[Extended period
1000 users/24hrs] B1 --> M1[Response time?] C1 --> M2[Max capacity?] D1 --> M3[Recovery time?] E1 --> M4[Memory leaks?] style B fill:#4caf50 style C fill:#ff9800 style D fill:#f44336 style E fill:#2196f3

💡 What Each Test Type Reveals

Baseline Test: Establishes normal performance — response times, throughput, and resource usage at expected daily traffic. This is your "everything is fine" benchmark.

Stress Test: Gradually increases load until the system breaks. The goal isn't to prevent breaking — it's to know where it breaks so you can set autoscaling triggers and capacity alerts.

Spike Test: Simulates sudden traffic bursts (a viral tweet, a flash sale). Tests autoscaling speed and whether the system degrades gracefully or crashes completely.

Endurance Test (Soak Test): Runs sustained load for hours or days. Catches memory leaks, connection pool exhaustion, and disk space accumulation that only appear over time.

📖 Tools to Document in Your SDD

Popular load testing tools include k6 (scriptable, developer-friendly), Locust (Python-based, distributed), Gatling (Scala-based, detailed reports), and Apache JMeter (GUI-based, legacy standard). Your SDD should specify which tool, how tests are configured, and how results are stored for comparison across releases.

Performance Budget: Speed as a Feature

A performance budget is a set of hard limits on metrics that matter to users. If a change would break the budget, it doesn't ship. This turns performance from a vague goal into an enforceable quality gate.

Performance Budget Example

Metric	Budget	Current	Status
Page Load Time	< 3s	2.1s	✅ Pass
Time to Interactive	< 5s	4.2s	✅ Pass
Bundle Size	< 200KB	245KB	❌ Fail
API Response (p95)	< 100ms	87ms	✅ Pass

Action Required: Reduce JavaScript bundle size by 45KB through code splitting and tree shaking. Budget violations block deployment until resolved.

🚫 Budget Without Enforcement Is Wishful Thinking

A performance budget only works if it's automated. Add Lighthouse CI, bundlesize, or similar tools to your CI/CD pipeline so that budget violations fail the build. A performance budget that lives only in a document and is never checked is just decoration.

Scalability Patterns: Building for Growth

Scalability patterns are architectural recipes for handling growth. Each pattern solves a different scaling problem, and your SDD should document which patterns you're using, why, and where the boundaries are.

💡 Choosing the Right Pattern

Microservices: Each service scales independently based on its own load. The user service can have 10 instances while the payment service only needs 2. Best when teams are also independent and deploy separately. Overhead: service discovery, network latency, distributed tracing.

Event-Driven: Producers publish events; consumers process them asynchronously. Decouples components so a slow consumer doesn't block a fast producer. Best for workflows where order processing, notifications, and analytics don't all need to happen in the same request.

CQRS (Command Query Responsibility Segregation): Separate the write model (commands) from the read model (queries). Allows each side to be optimized and scaled independently — reads often outnumber writes 100:1, so the read side can use denormalized views and aggressive caching.

⚠️ Don't Over-Architect

Microservices, event sourcing, and CQRS are powerful — but they add significant complexity. A monolith that handles your current load with room to grow is better than a distributed system that's harder to debug than the problems it solves. Document your scaling triggers in the SDD: "We'll extract the recommendation engine into a separate service when recommendation queries exceed 30% of database load."

The Golden Rules of Performance

✅ Performance Commandments for Your SDD

Measure first, optimize second. Don't guess where the bottlenecks are — profile, trace, and instrument. Developers are notoriously bad at predicting what's slow.

Cache everything reasonable. But document your invalidation strategy and accept that cache invalidation is genuinely hard.

Database queries are expensive. Optimize queries, add indexes wisely, and batch where possible. Every round trip to the database costs milliseconds.

Async when possible. Don't block the request thread on I/O operations. Email sending, image processing, report generation — all of these belong in background workers.

Minimize payload size. Compress responses (gzip/brotli), paginate results, and send only the fields the client actually needs.

Set performance budgets. Make speed a feature, not an afterthought. Automate enforcement in CI/CD.

Plan for 10× growth. Design systems that can scale beyond current needs without a rewrite. Document the path from 1× to 10× to 100× in your SDD.

Monitor production. Real users do unexpected things. Synthetic tests are necessary but not sufficient — use APM tools (Datadog, New Relic, Grafana) to observe real behavior.

Fail fast. Quick timeouts prevent cascade failures. A circuit breaker that returns an error in 50ms is better than a hanging request that ties up resources for 30 seconds.

Optimize the critical path. Focus on what users experience most. The login flow and main dashboard matter more than the admin settings page.

In the next lesson, we'll explore testing strategies and quality assurance — because fast systems that don't work correctly are just very efficient at failing!

💡 Key Takeaway

Performance is not something you bolt on after launch — it's a design discipline that starts in the SDD. Document your scaling strategy, your performance targets (with percentiles, not averages), your caching layers, and your database optimization plan. A system that's fast at 100 users but collapses at 10,000 is a system that was never designed for success.