Normalization of Database

Purpose

Normalization is defined as the process of efficiently organizing data in a database. There are ultimately two goals of the normalization process. The first is to eliminate redundant data. Redundant data is defined as storing the same data in more than one table. The second is to ensure that data dependencies make sense by having only related data stored in the same table. Both of these goals are important since they reduce the amount of space a database consumes and ensures that data is logically stored.

The Normal Forms

Formal guidelines have been established by the database community to ensure that databases are normalized. These guidelines are referred to as “normal forms” and are numbered from one (First Normal Form (1NF) or the lowest form of normalization, through five (Fifth Normal Form (5NF) which is the highest form of normalization. In many applications, you’ll often see
1NF,
2NF, and 3NF
along with an occasional 4NF. Fifth Normal Form is seen very rarely and therefore will not be further discussed.

It is extremely important to mention that these are only guidelines and there are always exceptions to the rule. Occasionally, it may become necessary to stray from the guidelines in order to meet practical business requirements. However, when variations take place, it’s extremely important to evaluate any possible effect they could have on your system and account for possible data inconsistencies.

First Normal Form (1NF)

First Normal Form (1NF) sets the very basic rules for an organized database as follows:

  • Eliminate duplicate columns from the same table.
  • Create separate tables for each group of related data and identify each row by using a unique column or set of columns (i.e., primary key).

Second Normal Form (2NF)

Second Normal Form (2NF) further addresses the concept of removing duplicate data as follows:

  • Meet all the requirements of the First Normal Form.
  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys

Third Normal Form (3NF)

Third normal form (3NF) goes one step further as follows:

  • Meet all the requirements of the Second Normal Form.
  • Remove columns that are not dependent upon the primary key.

First Normal Form (1NF)

First Normal Form (1NF) sets the very basic rules for an organized database as follows:

  • Eliminate duplicate columns from the same table.
  • Create separate tables for each group of related data and identify each row with a unique column or set of columns (i.e., primary key).

The first rule requires that data must not be duplicated within the same row of a table. This concept is often referred to as the atomicity of a table and tables that comply with this rule are said to be atomic. For example, let’s closely examine a table within a human resources database that stores the manager-employee relationship. Let’s assume the business rule that each manager may have one or more employees while each employee may have only one manager.

When we are creating a list or spreadsheet to track this information, we might create a table with the following fields:

  • Manager
  • Employee1
  • Employee2
  • Employee3
  • Employee4

However, recall that the first rule imposed by 1NF is to eliminate duplicate columns from the same table. Clearly, the Employee1-Employee4 columns are duplicates. Take a moment and think about the potential problems raised by this scenario. If a manager only has one employee – the Employee2-Employee4 columns are simply wasted storage space (a precious database commodity). Furthermore, imagine the case where a manager already has 4 employees – what happens if he or she takes on another employee? The whole table structure would require modification.

It is at this point that a second idea usually occurs to database novices: We don’t want to have more than one column and we want to allow for a flexible amount of data storage.

We could try something like this:

  • Manager
  • Employees (Employees field contains multiple entries in the form “Mary, Bill, Joe”, for example.)

This solution is closer, but it is not quite correct. The Employees column is still duplicative and non-atomic. What happens when we need to add or remove an employee? We need to read and write the entire contents of the table. That’s not a big deal in this situation, but what if one manager had over a hundred employees? Also, it complicates the process of selecting data from the database in future queries.

The following is a table that correctly satisfies the first rule of 1NF:

  • Manager
  • Employee

In this case, each employee has a single entry, but managers may have multiple entries.

We must also identify each row with a unique column or set of columns (the primary key). You might take a look at the table above and suggest the use of the employee column as a primary key. In fact, the employee column is a good candidate for a primary key due to the fact that our business rules specified that each employee may have only one manager. However, the data that we’ve chosen to store in our table makes this a less than ideal solution. What happens if we hire another employee named Jim? How do we store his manager-employee relationship in the database?

It’s best to use a truly unique identifier (such as an employee ID) as a primary key.

Our final table would look like this:

  • Manager ID
  • Employee ID

Second Normal Form (2NF)

Second Normal Form (2NF) further addresses the concept of removing duplicate data as follows:

  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys.

These rules can be summarized in a simple statement as follows:

2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables.

For example, imagine a company that maintains customer information in a database. They might have a single table called Customers with the following elements:

  • CustNum
  • FirstName
  • LastName
  • Address
  • City
  • State
  • ZIP
  • Phone
  • E-mail

Let’s assume that we are storing the “Chicago, IL 60604” and “Orland Park, IL 60462” entries twice each. This may not seem like too much added storage in our simple example, but imagine the wasted space if we had thousands of rows in our table. Additionally, if the ZIP code for Chicago were to change, we’d need to make that change in many places throughout the database.

In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let’s call it ZIPs) might have the following fields:

  • ZIP
  • City
  • State

If we want to be super-efficient, we can even fill this table in advance — the post office provides a directory of all valid ZIP codes and their city/state relationships. Surely, you’ve encountered a situation where this type of database was utilized. Someone taking an order might have asked you for your ZIP code first and then knew the city and state you were calling from. This type of arrangement reduces operator error and increases efficiency.

Now that we’ve removed the duplicate data from the Customers table, we’ve satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We’ll use the ZIP code (the primary key from the ZIPs table) to create that relationship.

Here’s our new Customers table:

  • CustNum
  • FirstName
  • LastName
  • Address
  • ZIP
  • Phone
  • E-mail

We’ve now minimized the amount of redundant information stored within the database and our structure is in Second Normal Form!

Third Normal Form (3NF)

There are two basic requirements for a database to be in Third Normal Form:

The requirements of both 1NF and 2NF have been met
Remove columns that are not fully dependent upon the primary key.

Let’s assume that we have a table of orders that contains the following attributes:

  • Order Number
  • Customer Number
  • Unit Price
  • Quantity
  • Total

Our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any duplicate columns? No. Is there a primary key? Yes, the order number. Therefore, the requirements of 1NF have been satisfied. Are there any subsets of data that apply to multiple rows? No, so the requirements of 2NF have also been satisfied.

Are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn’t appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we’re OK there.

What about the total? It looks like there might be some trouble here. The total can be derived by multiplying the unit price by the quantity; therefore it’s not fully dependent upon the primary key. It must be removed from the table to comply with the third normal form.

We will use the following attributes:

  • Order Number
  • Customer Number
  • Unit Price
  • Quantity

Now our table is in 3NF. But, you might ask, what about the total? This is a derived field and its best not to store it in the database at all. We can simply compute it “on the fly” when performing database queries. For example, we might have previously used this query to retrieve order numbers and totals:

SELECT OrderNumber, Total
FROM Orders

We can now use the following query:

SELECT OrderNumber, UnitPrice * Quantity AS Total
FROM Orders

to achieve the same results without violating normalization rules.

Leave a Reply

Your email address will not be published. Required fields are marked *