GCSE Link: None
In any database, we want to avoid unnecessary data duplication at all costs. If we don't, it can lead to data redundancy (which can increase the amount of storage space required) and data inconsistency (where the same piece of data apparently has different values).
Database normalisation is the process of organising a database in order to eliminate data redundancy and inconsistency.
Let's take an example database and work through the steps of normalising it:
Orders
| OrderID | OrderDate | Products | CustomerName | CustomerAddress |
|---|---|---|---|---|
| 17342 | 15/07/2023 | 3 x Red T-Shirt, 2 x Blue Jeans | John Doe | 153 Crown St, Reading |
| 17343 | 21/07/2023 | 5 x Black Socks | Sarah Brown | 26 West Rd, Bracknell |
| 17344 | 23/07/2023 | 2 x Blue Jumper, 1 x White Trainers, 1 x Maroon Tracksuit | John Doe | 153 Crown St, Reading |
| 17345 | 01/08/2023 | 3 x Orange Cap | Richard Jones | 36 Squirrels Way, Earley |
| 17346 | 01/08/2023 | 1 x Green Hoodie | Richard Jones | 36 Squirrels Way, Earley |
The first thing we do is make sure all data is atomic (contains only one value). Here, the
Products field is not atomic, so we must create another table for it. Also, the
CustomerName and CustomerAddress fields can be decomposed further as well.
Orders
| OrderID | OrderDate | CustomerForename | CustomerSurname | AddressNumber | AddressRoad | AddressTown |
|---|---|---|---|---|---|---|
| 17342 | 15/07/2023 | John | Doe | 153 | Crown St | Reading |
| 17343 | 21/07/2023 | Sarah | Brown | 26 | West Rd | Bracknell |
| 17344 | 23/07/2023 | John | Doe | 153 | Crown St | Reading |
| 17345 | 01/08/2023 | Richard | Jones | 36 | Squirrels Way | Earley |
| 17346 | 01/08/2023 | Richard | Jones | 36 | Squirrels Way | Earley |
ProductInOrder
| OrderID | ProductType | ProductColour | Quantity |
|---|---|---|---|
| 17342 | T-Shirt | Red | 3 |
| 17342 | Jeans | Blue | 2 |
| 17343 | Socks | Black | 5 |
| 17344 | Jumper | Blue | 2 |
| 17344 | Trainers | White | 1 |
| 17344 | Tracksuit | Maroon | 1 |
| 17345 | Cap | Orange | 3 |
| 17346 | Hoodie | Green | 1 |
However, we still have a problem with possible data duplication in the Orders table,
with all of the Customer fields. These belong in their own table.
Orders
| OrderID | OrderDate | CustomerID |
|---|---|---|
| 17342 | 15/07/2023 | 7086 |
| 17343 | 21/07/2023 | 6227 |
| 17344 | 23/07/2023 | 7086 |
| 17345 | 01/08/2023 | 7139 |
| 17346 | 01/08/2023 | 7139 |
ProductInOrder
| OrderID | ProductType | ProductColour | Quantity |
|---|---|---|---|
| 17342 | T-Shirt | Red | 3 |
| 17342 | Jeans | Blue | 2 |
| 17343 | Socks | Black | 5 |
| 17344 | Jumper | Blue | 2 |
| 17344 | Trainers | White | 1 |
| 17344 | Tracksuit | Maroon | 1 |
| 17345 | Cap | Orange | 3 |
| 17346 | Hoodie | Green | 1 |
Customer
| CustomerID | Forename | Surname | AddressNumber | AddressRoad | AddressTown |
|---|---|---|---|---|---|
| 6227 | Sarah | Brown | 26 | West Rd | Bracknell |
| 7086 | John | Doe | 153 | Crown St | Reading |
| 7139 | Richard | Jones | 36 | Squirrels Way | Earley |
We have one final problem. Ideally, the products would be in their own table as well.
Orders
| OrderID | OrderDate | CustomerID |
|---|---|---|
| 17342 | 15/07/2023 | 7086 |
| 17343 | 21/07/2023 | 6227 |
| 17344 | 23/07/2023 | 7086 |
| 17345 | 01/08/2023 | 7139 |
| 17346 | 01/08/2023 | 7139 |
ProductInOrder
| OrderID | ProductID | Quantity |
|---|---|---|
| 17342 | 23 | 3 |
| 17342 | 70 | 2 |
| 17343 | 81 | 5 |
| 17344 | 47 | 2 |
| 17344 | 88 | 1 |
| 17344 | 64 | 1 |
| 17345 | 95 | 3 |
| 17346 | 51 | 1 |
Customer
| CustomerID | Forename | Surname | AddressNumber | AddressRoad | AddressTown |
|---|---|---|---|---|---|
| 6227 | Sarah | Brown | 26 | West Rd | Bracknell |
| 7086 | John | Doe | 153 | Crown St | Reading |
| 7139 | Richard | Jones | 36 | Squirrels Way | Earley |
Product
| ProductID | ProductType | Colour |
|---|---|---|
| 23 | T-Shirt | Red |
| 47 | Jumper | Blue |
| 51 | Hoodie | Green |
| 64 | Tracksuit | Maroon |
| 70 | Jeans | Blue |
| 81 | Socks | Black |
| 88 | Trainers | White |
| 95 | Cap | Orange |
The database is now fully normalised. To verify this, we check if the following statement is true: "all non-key attributes in every table depend only upon the key, the whole key, and nothing but the key".
A data anomaly is an inconsistency in a database. How does normalisation help to avoid data anomalies?
If a database is normalised, each piece of data should be stored in only one place, and therefore only needs to be updated in one place if required. In denormalised databases, data needs to be updated using complex queries, which could lead to data anomalies if not done correctly.