Database Normalization
May 21, 2008 by Nathan Gardner
Database normalization is the practice of designing table structures for relational databases that prevent duplicate data and structual problems. A general rule of thumb is - whenever there are going to be multiple copies of data, they should be stored as Ids that point to a single copy of that bit of data.
It should be noted that there are several levels of normalization. Check out Wikipedia’s article on database normalization for more information on the different levels of normalization.
For example, here is an un-normalized database structure for storing employees.
| Employees | |||
| id | name | position | manager |
|---|---|---|---|
| 1 | Frank | General Manager | |
| 2 | Sam | Sales | Frank |
| 3 | Tom | Sales | Frank |
| 4 | Lucy | Receptionist | Frank |
There are 2 columns of data that have repetitive information. The Manager column, and the Position column. Here are the normalized table structures.
| Employees | |||
| id | name | positionId | managerId |
|---|---|---|---|
| 1 | Frank | 1 | |
| 2 | Sam | 2 | 1 |
| 3 | Tom | 2 | 1 |
| 4 | Lucy | 3 | 1 |
| Positions | |||
| id | name | description | |
|---|---|---|---|
| 1 | General Manager | These are general managers | |
| 2 | Sales | Sales staff | |
| 3 | Receptionist | Customer support and front desk | |
| 4 | Customer Service | Customer service reps | |
As you can see, we now have 2 tables. One that just stores Positions, and one that stores Employees. We also modified the Employees table so instead of storing the managers name, we store his employeeId. Also, instead of storing the position title, we store the positions Id in the Employees table, and the position details in the Positions table. This new table structure has several advantages.
- We can now store additional details about the position in the position table (the description for example)
- If franks name changes, it wont cause problems like the unnormalized table structure would.
- If you were to create a admin interface for this database, the form that allows you to create or edit employees could easily have a pull down for possible positions. It would also allow you to add Positions even if there are no employees in that positions.
However, with all good things there are some drawbacks. The major drawback is the table structure is slightly more complex and requires JOINS to select all data. Also, selects will be slightly slower (milliseconds) - however with the proper use of keys, queries will be just as quick - and depending on the data, possibly faster.
To better represent the advantages of normalized tables, consider the following.
Your company hires 5 temps to put 500 employees into your database. With the un-normalized table It is very possible (and likely) that receptionist would be put into the database with slightly different position titles. For example, “reception”, “Receptionists”, “Front desk”, etc. With the normalized tables, it forces the temps to assign employees to a position in the position table, and your data will be more reliable.
Now your boss asks your to write a report that shows him how many employees are in each position. With the un-normalized table, your query would return “reception” “Receptionists” and “front desk” as different positions and you will be forced to do data cleanup.
Now also consider employees can be in multiple positions. With our un-normalized table, we would have to add additional columns like so.
| Employees | |||||
| id | name | position1 | position2 | position3 | manager |
|---|---|---|---|---|---|
| 1 | Frank | General Manager | Customer Service | ||
| 2 | Sam | Sales | Customer Service | Frank | |
| 3 | Tom | Sales | Frank | ||
| 4 | Lucy | Receptionist | Frank | ||
This presents several problems. First off, employees can only be in 3 positions. Also, most employees will only be in 1 position - so there will be several rows with empty data which is a waste of space. So, how do we put employees in multiple positions in a normalized structure? Linking tables.
We modify the Employees table so it does not have a Positions column, and create a new linking table like the one below.
| EmployeesPositions | |||||
| id | employeeId | positionId | |||
|---|---|---|---|---|---|
| 1 | 1 | 1 | |||
| 2 | 1 | 4 | |||
| 3 | 2 | 2 | |||
| 4 | 2 | 4 | |||
| 5 | 3 | 2 | |||
| 6 | 4 | 3 | |||
This tells us that employee 1 (Frank) is in position 1 (General Manager) and 4 (Customer Service).
If your boss asks you to write a report that counted the number of employees in each position - it would be near impossible with the un-normalized table structure. However, with the normalized table structure - its as simple as this…
SELECT p.`name` as 'position', count(ep.`id`) as 'count' FROM EmployeesPositions ep JOIN Positions p ON ep.`positionId` = p.`id` GROUP BY ep.`positionId`



Comments
Feel free to leave a comment...
and oh, if you want a pic to show with your comment, go get a gravatar!