Basics of designing a database

Peter Brawley, Arthur Fuller

A database design specifies ...

what data needs to be stored, how, and where
relationships amongst elements of such data
how such data are to be input, organised, accessed, reported, secured

Think of it as a mapping of such needs to full specifications of database tables, columns, procedures, events and actions. When it's done, you'll have produced ...

documentation of all elements of the requirement in exhaustive detail
mappings of all such elements to database objects and actions in database definition language (DDL) and database manipulation language (DML)
reports of tests of the accuracy of those mappings
optionally, entity-relationship-diagrams (ERDs) for all database objects
the database

Getting there is a 10-step ...

1 The Raw Requirement

List everything the application has to do, down to the last detail. Note that this must include not just specifications of data; it also musty specify who will be accessing what data and who is allowed to do what with such data. If multiple apps will use the database, include everything that all the apps have to do. This first list needn't yet be organised by sequence or topic. It just has to be complete.

2 The Elements

Make a working copy of the Requirement. In it underline or otherwise mark every noun relevant to the requirement.

3 The Entities

Group the Elements from Step 2 into lists of attributes that belong together, such that ...

each list defines exactly one kind of thing that the app needs to deal with— e.g., a customer, a book, a widget,
no element is itself a list.

Here you might notice that you're building a set of trees, each tree branch ending with nodes that have no more branches. A customer has a first name, last name, SSN, home phone and so on. An order has a customer reference, a date, a shipping method, perhaps a memo text field. An order item has an order reference, an inventory reference, a quantity and a base price. A book has a title, ISBN number, publisher, publication date &c.; A widget has a name, description, supplier, perhaps a patent number.

4 The Tables

Turn the Entities from Step 3 into table specifications:

If you are using an ERD tool, how much table design you can get done in that tool depends on the its capabilities, but there'll always be much doubling back and forth amongst requirements and mapping documentation, database code, and the ERD tool.

4a: Remove as much redundancy as possible. If you are writing an order entry app, the core data will include ...

customers
  orders
    orderitems

If your app is to manage classes and instructors across multiple colleges in multiple counties, the core data will fall into a tree like this:

county
  college
    school
      department
        course
          class section
            class datetime
              instructor

If you look at such a tree top-down, each entity that isn't a leaf has a parent-child or has-a relationship to the next entity below it: a county has colleges, which have schools, etc. That's a 1:many relationship. If you look at the tree bottom-up, each entity except the root has a child-parent or belongs-to relationship to the entity above it: a course belongs to a department, which belongs to a school, etc. This is a many:1 relationship.

4b: Turn each entity into a table definition with a table name, a list of required entity attributes as columns, and a primary key definition that guarantees row uniqueness and if possible is world-independent (e.g., not a value generated by a government agency). No, a person's or firm's name won't cut it as a primary key—how many Jane Smiths are there in the world? If the key is (lastname,firstname,SSN), Jane's primary key may change when she marries, messing up table relationships that depend on that key value. Note too that in some countries including Canada, certain external unique identifiers such as Social Insurance Number legally can be used only for purposes relating to the government.

Sometimes there's a combination of unique never-changing real-world column values that will serve as an adequate primary key. More often the safest bet is a "surrogate" key auto_increment integer ID column. Using such a unique auto-generated arbitrary number to identify all table rows guarantees row uniqueness and removes the need ever to edit such primary key values. In MySQL, declare such keys integer_type unsigned primary key auto_increment where integer_type is one of ...

tinyint where there can be no more than 2⁸-1 (255) values
smallint where there can be no more than 2¹⁶-1 (~65 M) values
mediumint where there can be no more than 2²⁴-1 (~16 MM) values
int where there can be no more than 2³²-1 (~4 bln) values
bigint where there can be no more than 2⁶⁴-1 (~18 bln bln) values

4c: Make each table self-consistent. For example it seems obvious that a customer has an address. Unless the app will have homeless customers, that's usually the case. It seems natural then to give the customer table some address columns and leave it at that.

Combining addresses with other identifying info can be problematic. Changing an address doesn't change the customer; they're two different sorts of things. Many people and organisations have mutliple addresses (branch offices; school; summer, a particular job or project &c;), not to mention the very affluent who have homes in several countries.

In database-speak, the relationship from customers to addresses is 1:many, so in many cases the customer table with address columns needs to become two tables: a customer table and an address table with a customer foreign key. Likewise an orders table with order item columns becomes an orders table and an order items table with an orders foreign key. Simpler, less redundant, more reportable, more maintainable.

4d: Give each child table a column that references its parent table, to automate the database logic that keeps these table relationships intact. Such a column is a foreign key. Do not carry IDs up to grandparents; that would introduce redundancy, complicating data management and queries.

4e: Configure lookup tables not covered by steps 4c and 4d. A child can have multiple parents. Any table in the chain may also belong to a table that's not in a main chain. For example counties, schools and departments may each have a computerservicesID column that points to a computer services table, and a staffID column pointing to a staff table:

CREATE TABLE school ( ID int auto_increment primary key, name varchar(50) not null Default 'Not known', universityID int not null references university(ID), computerservicesID int not null references computerservices(ID), deanstaffID int not null references staff(ID), ... )
That's how we model schools that share a computer service, staff who work in multiple schools &c.; The diagram begins to get more interesting ...

  state <------------------+
     |                     |
    university <----- computerservices
        |                    |    |
      school <---------------+    |
          |                       |
        department <--------------+
           |
          etc

4f: Add bridge tables for many-many relationships: Students take multiple tests and each test is taken by multiple students; that's a many:many relationship. To implement it, you need a bridge table, a students_testresults table that has its own primary key column, plus a foreign key column pointing to a students table row, and a foreign key column pointing to a particular testresults table row.

Steps 4a through 4f bring your database to Third Normal Form, roughly. We can get rid of yet more redundancy, e.g., by creating a schoolnames table, replacing the name column in schools with a schoolnameID column that points to the schoolnames table row for that name. Now no name is stored twice. Whether to take normalisation that far depends on how important the redundancy is for queries, and for available storage space. Third Normal Form usually suffices.

5 The Actions

In the raw requirement (Step 1), underline or otherwise mark all the verbs to get a first outline of what actions will have to be performed on the Tables. Then organise those actions into query specifications (e.g., a shipping clerk begins by looking up an order by date, customer or order number...).

6 Build the database from the normalised model

If you are using the ERD tool in MySQL WorkBench, a button click gets it done.

It's the first test of your model: does it parse, can your RDBMS actually buold the database you defined? Inevitably, you'll find errors, inconsistencies, omissions. Rinse and repeat.

7 Seed the database with data

Populate your lookup tables, then insert test data into the database.

8 The Queries

This is the second real test of your database. Translate every Action from Step 5 into a query. Notice that this includes all anticipated inputs and reports!

9 Test loop and revise as need be

You'll find that Step 8 uncovers some aspects of the database model that need touching up. Loop back as far as need be to plug holes and correct errors.

10 Done

Write no application code till you arrive here.

For more, see ...

Database design
Normalisation
Database design guide
Database resources on the web
Relational databases
Ten common database design mistakes
Writing a specification

Back to Artful ...

Last updated 28 Aug 2024