🗄️ Lesson 5: Data Design and Modeling

If software architecture is the skeleton of your system, then data design is its lifeblood. Just as a doctor needs to understand how blood flows through the body, developers need to understand how data flows through their applications. Get data design wrong and every feature you build on top of it will feel like swimming upstream.

🎯 Learning Objectives

By the end of this lesson, you will be able to:

Explain how databases organize data using the library catalog metaphor
Read and create Entity-Relationship (ER) diagrams for real-world domains
Apply the three normal forms to eliminate data redundancy
Trace data flows through a system using Data Flow Diagrams (DFDs)
Choose the right database type for a given use case
Plan a safe, validated data migration strategy

Estimated Time: 30 minutes

📑 In This Lesson

The Library Catalog Metaphor

Think of your database like a well-organized library. Books (data) are arranged in sections (tables), with catalog cards (indexes) helping you find what you need quickly. Without proper organization, you'd spend hours searching for a single book!

💡 Why Organization Matters

A library with 10 books doesn't need a card catalog. A library with 10 million books can't function without one. The same is true for databases — organization is cheap up front and invaluable at scale. If your queries are slow, the first question to ask isn't "do we need a faster server?" but "do we need a better index?"

Entity-Relationship Diagrams: The Family Tree of Data

ER diagrams show how your data entities relate to each other, like a family tree shows relationships between people. Each box is an entity (a "thing" you store), each line is a relationship, and the symbols at the ends of the lines tell you the cardinality — how many of one thing connect to another.

Let's build one for a social media platform:

erDiagram USER ||--o{ POST : creates USER ||--o{ COMMENT : writes USER ||--o{ FRIEND_REQUEST : sends USER ||--o{ FRIEND_REQUEST : receives POST ||--o{ COMMENT : has POST ||--o{ LIKE : receives USER ||--o{ LIKE : gives USER { int user_id PK string username UK string email UK string password_hash datetime created_at datetime last_login boolean is_active } POST { int post_id PK int user_id FK text content string image_url datetime posted_at int like_count int comment_count } COMMENT { int comment_id PK int post_id FK int user_id FK text content datetime commented_at } LIKE { int like_id PK int user_id FK int post_id FK datetime liked_at } FRIEND_REQUEST { int request_id PK int sender_id FK int receiver_id FK string status datetime sent_at datetime responded_at }

📖 Reading the Crow's Foot Notation

The symbols at each end of a relationship line tell a story. || means "exactly one," o{ means "zero or many." So USER ||--o{ POST reads: "one user creates zero or many posts." Learning to read these symbols is like learning to read musical notation — clumsy at first, fluent with practice.

⚠️ Watch for Denormalized Counters

Notice like_count and comment_count on the POST entity? These are denormalized — you could compute them by counting rows in the LIKE and COMMENT tables, but storing them directly avoids an expensive COUNT query on every page load. This is a deliberate trade-off: faster reads in exchange for extra bookkeeping on writes. Always document why you denormalized.

Normalization: The Marie Kondo of Databases

Just as Marie Kondo teaches us to organize our homes by keeping only what "sparks joy" and avoiding duplication, database normalization teaches us to organize data efficiently. The goal is a single source of truth for every piece of information.

💡 The Three Normal Forms (Quick Version)

1NF — No repeating groups: Every cell holds a single value. No comma-separated lists stuffed into one column.

2NF — No partial dependencies: Every non-key column depends on the entire primary key, not just part of it.

3NF — No transitive dependencies: Non-key columns depend on the key and nothing but the key. If column A determines column B, and B determines column C, then C should live in its own table.

⚠️ When to Stop Normalizing

Over-normalization is a real pitfall. If every query requires five JOINs, your read performance will suffer and your SQL will be painful to maintain. Normalize until it hurts, then denormalize where you can measure a concrete performance gain — and document every denormalization decision in your SDD.

Data Flow Diagrams: Following the River

Data flow diagrams (DFDs) show how information moves through your system, like following a river from its source to the ocean. External entities (the squares) push data in, processes (the circles) transform it, and data stores (the open rectangles) hold it at rest. Let's trace the journey of an online order:

📖 DFD Levels of Detail

Like the C4 model from Lesson 4, DFDs support progressive zoom. A Level 0 (context) diagram shows the whole system as a single process. Level 1 breaks that process into sub-processes (like the diagram above). Level 2 expands any sub-process that's still complex. Start at Level 0 for stakeholders, zoom in for developers.

🚫 Don't Mix Data Flow with Control Flow

A common mistake is drawing DFDs that show when things happen (sequence, conditionals, loops). That's a flowchart, not a DFD. A DFD only shows what data moves where. If you catch yourself writing "if" on a DFD arrow, stop — you've crossed into control flow territory.

Choosing the Right Database: The Tool for the Job

Like choosing between a hammer and a screwdriver, different databases serve different purposes. The worst mistake isn't picking the "wrong" database — it's using one database for everything because it's the only one you know.

✅ The Decision Framework

Need ACID compliance and complex queries? → Relational (PostgreSQL, MySQL). This is the safe default for most applications.

Flexible, nested, or document-shaped data? → Document (MongoDB, CouchDB). Great for content management and catalogs.

Sub-millisecond key lookups or caching? → Key-Value (Redis, DynamoDB). Perfect as a complementary store alongside a primary database.

Highly connected data with traversal queries? → Graph (Neo4j). Social networks, recommendation engines, fraud detection.

⚠️ Polyglot Persistence

Many production systems use multiple database types — a relational DB for transactional data, Redis for caching and sessions, and maybe Elasticsearch for full-text search. This is called polyglot persistence. It's powerful but adds operational complexity. Document each data store, its purpose, and its sync strategy in your SDD.

Data Migration: Moving Day for Your Data

Data migration is like moving to a new house — you need to plan carefully, pack properly, and make sure nothing gets lost along the way. The golden rule: never do a big-bang migration on a Friday afternoon.

graph TD A[Migration Planning] --> B[Data Analysis] B --> C[Schema Mapping] C --> D[Test Migration] D --> E{Validation OK?} E -->|No| C E -->|Yes| F[Production Migration] F --> G[Post-Migration Validation] G --> H[Cleanup & Optimization] B --> I[Identify data types] B --> J[Check constraints] B --> K[Document relationships] C --> L[Field mapping] C --> M[Transform rules] C --> N[Default values]

💡 Migration Safety Checklist

Before you start: Take a full backup. Then verify the backup actually restores. Then take another backup.

During migration: Run in a staging environment first. Compare row counts, checksums, and sample records between old and new. Automate validation — eyeballing 10 rows out of 10 million proves nothing.

After migration: Keep the old system read-only for a rollback window (typically 1–2 weeks). Monitor error rates, query performance, and data freshness in the new system before decommissioning the old one.

Best Practices: The Data Commandments

📏 The Sacred Rules of Data Design

1. Name Things Clearly: Use descriptive names. user_email beats ue every time. A good column name tells you what it holds without reading any documentation.

2. Be Consistent: If you use snake_case in one table, don't switch to camelCase in another. Pick a convention, document it, and enforce it in code review.

3. Plan for Growth: That phone_number field? Make it big enough for international numbers with country codes. That INT primary key? Consider whether you'll ever need more than 2.1 billion rows.

4. Index Wisely: Like bookmarks in a book — useful, but too many slow down writes and bloat storage. Index columns you filter, sort, or join on. Profile your actual queries before adding speculative indexes.

5. Document Everything: Future you (and your teammates) will thank present you. Every table, every column, every constraint should have a comment explaining why it exists.

6. Think About Deletion: Soft deletes (a deleted_at timestamp) let you recover mistakes and satisfy audit trails. Hard deletes save space but are permanent. Choose based on your compliance and audit needs.

7. Version Your Schema: Track schema changes with migration files — just like you track code changes with Git. Tools like Flyway, Alembic, or Knex migrations make this painless.

🚫 The "We'll Fix It Later" Trap

Schema decisions are the hardest to change after launch. Renaming a column is easy. Splitting a table, migrating millions of rows, and updating every query in the codebase? That's a multi-sprint project. Invest the time to get data design right in the SDD before the first migration file is written.

Real-World Example: Healthcare System Data Model

Let's apply everything we've covered to a domain where data design has real-world consequences — a healthcare scheduling system. Notice the use of UUIDs instead of auto-incrementing integers (better for distributed systems), encrypted fields for sensitive data, and JSON columns for semi-structured data like contact info and vital signs.

erDiagram PATIENT ||--o{ APPOINTMENT : schedules DOCTOR ||--o{ APPOINTMENT : conducts APPOINTMENT ||--|| MEDICAL_RECORD : generates PATIENT ||--o{ MEDICAL_RECORD : has MEDICAL_RECORD ||--o{ PRESCRIPTION : contains DOCTOR ||--o{ PRESCRIPTION : writes PATIENT ||--o{ INSURANCE_CLAIM : submits PATIENT { uuid patient_id PK string first_name string last_name date birth_date string ssn_encrypted json contact_info timestamp created_at } DOCTOR { uuid doctor_id PK string license_number UK string specialization json schedule boolean is_active } APPOINTMENT { uuid appointment_id PK uuid patient_id FK uuid doctor_id FK timestamp scheduled_time int duration_minutes string status text notes } MEDICAL_RECORD { uuid record_id PK uuid patient_id FK uuid appointment_id FK text diagnosis text treatment_plan json vital_signs timestamp created_at }

📖 Design Decisions Worth Documenting

In a healthcare SDD, you'd call out choices like: why ssn_encrypted uses application-level encryption (HIPAA compliance), why contact_info is JSON rather than separate columns (patients may have multiple phones, emails, and addresses that change frequently), and why APPOINTMENT to MEDICAL_RECORD is one-to-one (business rule: every visit produces exactly one record). These decisions look obvious now but will be mysteries to the next developer if left undocumented.

In the next lesson, we'll explore API and interface design — the doors and windows through which your system communicates with the world. You'll learn how to document endpoints, versioning strategies, authentication flows, and error contracts.

💡 Key Takeaway

Data outlives code. The application that reads your tables today will probably be rewritten in five years, but the data will still be there. Design your schemas for the long haul: name things clearly, normalize until it hurts, document every denormalization, and version every migration.