What is Normal?
Feb
24
Written by:
Charles Flock
2/24/2009 7:41 PM
One of the big problems that you may have in moving from the spreadsheet paradigm to the database paradigm is in ‘normalizing’ your data. In this article, I will attempt to explain what that means.
If you are like me and have been using spreadsheets since the days of Visicalc (and Lotus 1-2-3 and QuattroPro), it is quite normal to visualize data in terms of rows (horizontal slices of data) and columns (vertical slices of data). Just imagine financial data (or if you can’t, go to some financial site like Moneycentral or the Wall Street Journal). In some cases, you will find the column headings to be a date, usually representing the end of a reporting period, and the rows will be some piece of financial information to be reported on; sales (or revenue), cost of sales, operating expenses, net income, etc. Here’s a fairly standard representation of a cash flow statement:
|
2008
|
2007
|
2006
|
2005
|
2004
|
Period End Date
|
06/30/2008
|
06/30/2007
|
06/30/2006
|
06/30/2005
|
06/30/2004
|
Period Length
|
12 Months
|
12 Months
|
12 Months
|
12 Months
|
12 Months
|
Stmt Source
|
10-K
|
10-K
|
10-K
|
10-K
|
10-K
|
Stmt Source Date
|
07/31/2008
|
08/03/2007
|
08/25/2006
|
08/26/2005
|
08/26/2005
|
Stmt Update Type
|
Updated
|
Updated
|
Updated
|
Updated
|
Restated
|
|
|
|
|
|
|
Net Income/Starting Line
|
17,681.0
|
14,065.0
|
12,599.0
|
12,254.0
|
8,168.0
|
Depreciation/Depletion
|
2,056.0
|
1,440.0
|
903.0
|
855.0
|
1,186.0
|
Amortization
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
Deferred Taxes
|
935.0
|
421.0
|
219.0
|
-179.0
|
-1,479.0
|
Non-Cash Items
|
787.0
|
1,181.0
|
1,356.0
|
2,589.0
|
5,538.0
|
Unusual Items
|
Other Non-Cash Items
|
|
Changes in Working Capital
|
153.0
|
689.0
|
-673.0
|
1,086.0
|
1,213.0
|
Accounts Receivable
|
Other Assets
|
Other Liabilities
|
|
Cash from Operating Activities
|
21,612.0
|
17,796.0
|
14,404.0
|
16,605.0
|
14,626.0
|
|
|
|
|
|
|
Capital Expenditures
|
-3,182.0
|
-2,264.0
|
-1,578.0
|
-812.0
|
-1,109.0
|
|
Other Investing Cash Flow Items, Total
|
-1,405.0
|
8,353.0
|
9,581.0
|
15,839.0
|
-2,233.0
|
Acquisition of Business
|
Sale/Maturity of Investment
|
Purchase of Investments
|
Other Investing Cash Flow
|
|
Cash from Investing Activities
|
-4,587.0
|
6,089.0
|
8,003.0
|
15,027.0
|
-3,342.0
|
|
|
|
|
|
|
Financing Cash Flow Items
|
120.0
|
54.0
|
89.0
|
-18.0
|
0.0
|
Other Financing Cash Flow
|
|
Total Cash Dividends Paid
|
-4,015.0
|
-3,805.0
|
-3,545.0
|
-36,112.0
|
-1,729.0
|
Issuance (Retirement) of Stock, Net
|
-9,039.0
|
-20,793.0
|
-17,106.0
|
-4,948.0
|
-635.0
|
Issuance (Retirement) of Debt, Net
|
0.0
|
0.0
|
0.0
|
0.0
|
0.0
|
Cash from Financing Activities
|
-12,934.0
|
-24,544.0
|
-20,562.0
|
-41,078.0
|
-2,364.0
|
|
|
|
|
|
|
Foreign Exchange Effects
|
137.0
|
56.0
|
18.0
|
-7.0
|
27.0
|
Net Change in Cash
|
4,228.0
|
-603.0
|
1,863.0
|
-9,453.0
|
8,947.0
|
|
|
|
|
|
|
|
|
|
|
|
|
Net Cash - Beginning Balance
|
6,111.0
|
6,714.0
|
4,851.0
|
14,304.0
|
5,357.0
|
Net Cash - Ending Balance
|
10,339.0
|
6,111.0
|
6,714.0
|
4,851.0
|
14,304.0
|
It turns out, though, from a data base perspective, this starts to present some serious challenges. To overcome those challenges, the concept of normalization was developed.
The simplest explanation of normalization that I can offer is the following. A table is normalized when you can find the data that you want using the key, the whole key, and nothing but the key. If you use this simple formula, I can guarantee that your table will be sufficiently normalized.
In this case, if we want to put this data in database to do some further analysis, its structure presents us with some problems. Why put it in a database? What if I want to compare 20 companies? What about 200 or 2,000 or 20,000? At some point, it just becomes impractical to try and manage that kind of volume in a spreadsheet.
The first thing that we have to realize is that we want to represent all companies in the same way on the database. This allows us to calculate and compare information in a wide variety of ways, with the assurance that the data has been consistently rendered.
Secondly, we then want to store data in such a way, that we will not have to change the database structure as new data is provided. In the above example, when the 2009 results are recorded, we need add another column to store those results. While you can do this in a spreadsheet, doing it in a database is problematic; none of your existing queries, views, stored procedures, etc. will recognize this column, or, perhaps the column will turn up in unexpected places, if you have used ‘SELECT *’ anywhere.
Let’s get right to the issue, then, of how to normalize this data. First, as I pointed out previously, we need to store information about the company. We could do that any number of ways, by using the company name, by using a number, or by using a symbol of some type. It is important, though, that whatever you pick uniquely identifies the company, as you would not want to mix company results together. Just to keep things simple, we will use the company’s stock symbol, though in real life, it might make more sense to use a number and associate the number with a symbol or the company name.
Next, look at the column headings and see if there is any way to generalize the headings. To me, it seems straightforward that the columns represent fiscal years.
Now I look at the rows. It seems to me that there are 2 distinct types of data. At the top of the table, there is some information about the FY, consisting of Period End Date, Period Length, Stmt Source, and Stmt Update Type. Then there is actual financial data, which consist of a description of the financial activity (net income, depreciation, changes in working capital, etc.). For purposes of this exercise, let’s focus on the financial data.
I think that at this point we can safely say that we can uniquely the financial information that we are interested in as consisting of the symbol, fiscal year, description of the financial activity, and the amount. We can then implement a simple table which consists of 4 columns:
o SYMBOL
o FY
o FINANCIAL ACTIVITY
o AMOUNT
with first three fields being the key. This means that these fields uniquely identify the remaining field in this table; the key, the whole key and nothing but the key.
Our data would then go into the database looking like this (and I will only do the Cash from Operating Activities):
SYMBOL
|
FY
|
FINANCIAL ACTIVITY
|
AMOUNT
|
XYZ
|
2008
|
Net Income
|
17681.00
|
XYZ
|
2007
|
Net Income
|
14065.00
|
XYZ
|
2006
|
Net Income
|
12599.00
|
XYZ
|
2005
|
Net Income
|
12254.00
|
XYZ
|
2004
|
Net Income
|
8168.00
|
XYZ
|
2008
|
Depreciation
|
2056.00
|
XYZ
|
2007
|
Depreciation
|
1440.00
|
XYZ
|
2006
|
Depreciation
|
903.00
|
XYZ
|
2005
|
Depreciation
|
855.00
|
XYZ
|
2004
|
Depreciation
|
1186.00
|
XYZ
|
2008
|
Amortization
|
0.00
|
XYZ
|
2007
|
Amortization
|
0.00
|
XYZ
|
2006
|
Amortization
|
0.00
|
XYZ
|
2005
|
Amortization
|
0.00
|
XYZ
|
2004
|
Amortization
|
0.00
|
XYZ
|
2008
|
Deferred Taxes
|
935.00
|
XYZ
|
2007
|
Deferred Taxes
|
421.00
|
XYZ
|
2006
|
Deferred Taxes
|
219.00
|
XYZ
|
2005
|
Deferred Taxes
|
-179.00
|
XYZ
|
2004
|
Deferred Taxes
|
-1479.00
|
XYZ
|
2008
|
Non-Cash Items
|
787.00
|
XYZ
|
2007
|
Non-Cash Items
|
1181.00
|
XYZ
|
2006
|
Non-Cash Items
|
1356.00
|
XYZ
|
2005
|
Non-Cash Items
|
2589.00
|
XYZ
|
2004
|
Non-Cash Items
|
5538.00
|
XYZ
|
2008
|
Changes in Working Capital
|
153.00
|
XYZ
|
2007
|
Changes in Working Capital
|
689.00
|
XYZ
|
2006
|
Changes in Working Capital
|
-673.00
|
XYZ
|
2005
|
Changes in Working Capital
|
1086.00
|
XYZ
|
2004
|
Changes in Working Capital
|
1213.00
|
When data is ready for 2009 or beyond, there is no need to change the layout of the table. I can have as money companies as I want, for as many fiscal years as I want. If I want to compare results across many companies, it becomes very straightforward and easy. It now becomes possible, for example, to use the FORECAST function on the database to predict 2009 Revenue (for example), for all companies, selected companies, any combination of companies that you might think of, and get the result in less than a second.
Take the time to normalize when you move your data from your spreadsheet and into the database. It will pay big dividends in terms of consistency, reliability, and speed of processing.