Archive for April, 2008

UNPIVOT: Normalizing data on the fly

Everyone seems to want to “pivot” or “cross tab” data, but knowing how to do the opposite is equally important.  In fact, I would argue that the skill of “unpivoting” data is more useful and more important and more relevant to a SQL programmer, since pivoting results in denormalized data, while unpivoting can transform non-normalized data into a normalized result set.  We all know that there’s lots of bad databases designs out there, so this can be a handy technique to know. 

Of course, even a well designed, fully normalized database can still benefit from “unpivoting” from time to time, so let’s take a look at some common situations and some of the options we have to handle this at our disposal.  We will focus on some traditional SQL techniques to do this, and then take a close look at the UNPIVOT operator that was introduced with SQL Server 2005.

Example #1:  A Bad database design

Let’s start with a commonly bad table design, in which someone has decided to relate a client to multiple contacts by designing their client table like this:

create table Clients
(  
    clientID int primary key,
    clientName varchar(100),
    contact1 int,
    contact2 int,
    contact3 int,
    contact4 int
)




insert into Clients

select 1,’ABC Corp’,1,34,2,null union all

select 2,’DEF Foundation’,6,2,8,9 union all

select 3,’GHI Inc.’,5,9,null,null union all

select 4,’XYZ Industries’,24,null,6,null



clientID    clientName           contact1    contact2    contact3    contact4

———– ——————– ———– ———– ———– ———–

1           ABC Corp             1           34          2           NULL

2           DEF Foundation       6           2           8           9

3           GHI Inc.             5           9           NULL        NULL

4           XYZ Industries       24          NULL        6           NULL



(4 row(s) affected)

(Note: For brevity, I am not including the contact table here, nor the foreign key constraints.  Of course, with this table design, it would probably be pretty unlikely to find such constraints in the database anyway)


With this design, it is not very easy or efficient to get a count of all contacts for each client, or to find out which contacts are related to which clients.   One thing we can do, however is to “unpivot” this table in a query that returns 1 row per ClientID/ContactID combination.  With that result set, we can easily now reference the table as if it were normalized and we can get the information we need.

One way to do is to use UNION ALL to return each row in the clients table 4 times, and each time return a different contactID column:

select clientID, contact1 as ContactID

from clients

where contact1 is not null

union all

select clientID, contact2 as ContactID

from clients

where contact2 is not null

union all

select clientID, contact3 as ContactID

from clients

where contact3 is not null

union all

select clientID, contact4 as ContactID

from clients

where contact4 is not null



clientID    ContactID

———– ———–

1           1

2           6

3           5

4           24

1           34

2           2

3           9

1           2

2           8

4           6

2           9



(11 row(s) affected)


Another option is to CROSS JOIN the Clients table with a table or resultset that returns 4 rows, which also effectively returns each row in the clients table 4 times.  For each of the 4 values in the table we are cross joining, we grab a different contact column:

select *

from

(

    select c.clientID,

       case n.n when 1 then c.contact1

            when 2 then c.contact2

            when 3 then c.contact3

            when 4 then c.contact4 end as ContactID

    from

        clients c

    cross join

        (select 1 as n union all select 2 union all select 3 union all select 4) n

)   

    x

where

    x.ContactID is not null



clientID    ContactID

———– ———–

1           1

1           34

1           2

2           6

2           2

2           8

2           9

3           5

3           9

4           24

4           6



(11 row(s) affected)


(Note that you can use a permanent table of Numbers in your database instead of generating it on the fly with a UNION, as shown)

Finally, however, there is an even eaiser way to handle this: the UNPIVOT operator, new with SQL 2005.  UNPIVOT works very efficiently and really allows you to handle this exact situation quite easily:

select clientID, Contact.ContactID

from clients

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact



clientID    ContactID

———– ———–

1           1

1           34

1           2

2           6

2           2

2           8

2           9

3           5

3           9

4           24

4           6



(11 row(s) affected)

Much shorter to write, and more efficient to execute as well. 

Taking a Closer Look at UNPIVOT

The UNPIVOT operator is tricky to get a feel for, however, so let’s take a look at it.

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact



First, the “As Contact” at the end is just labeling the entire unpivot result set with an alias, just as you must alias a derived table.  Each column returned by the pivot operator can be referenced by the alias if necessary.

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact

The “ContactID for” part says that we want to return a column called “ContactID” for each unpivoted row.  The IN() list is the columns that we are unpivoting; the values in the 4 columns listed here will be assigned to the ContactID column in the result.  So, the first time a particular row is unpivoted, the value of the ‘contact1′ column is assigned to ContactID, the next time it is the ‘contact2′ column, then ‘contact3′, and then finally ‘contact4′.  Then, the next row is processed and it all begins again. 

Thus, because we are unpivoting 4 values, we know that the result of the unpivot will have 4 times as many rows as the source data.

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact

UNPIVOT returns an additional column as well, which contains the name of the column that was used to produce each unpivoted row.   Here, we have specified that to be called ContactNumber.  Note that we actually did not return ContactNumber in our example, be we can easily add that in so you can see how it works:

select clientID, Contact.ContactNumber, Contact.ContactID

from clients

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact



clientID    ContactNumber           ContactID

———– ———————– ———–

1           contact1                1

1           contact2                34

1           contact3                2

2           contact1                6

2           contact2                2

2           contact3                8

2           contact4                9

3           contact1                5

3           contact2                9

4           contact1                24

4           contact3                6



(11 row(s) affected)

So, you can see that the code to write is very short, but a little difficult to grasp at first.  In the end, though, we are able to take a bad table design and easily “fix it”, at least temporarily, so that we can query it using simple and standard SQL statements to get what we need.

Example #2:  Normalizing a Transaction Table

Here’s another common example:

create table Transactions
(


    TranDate datetime,

    Account varchar(10),

    BudgetAmount money,

    ActualAmount money,

    ProjectionAmount money,

    primary key (TranDate, Account)

)

go

insert into Transactions

select ‘2008-01-01′,’0001′,354,65,58 union all

select ‘2008-01-02′,’0001′,14,65,34 union all

select ‘2008-01-03′,’0001′,0,65,622 union all

select ‘2008-01-04′,’0001′,9,32,84

go



TranDate                Account    BudgetAmount          ActualAmount          ProjectionAmount

———————– ———- ——————— ——————— ———————

2008-01-01 00:00:00.000 0001       354.00                65.00                 58.00

2008-01-02 00:00:00.000 0001       14.00                 65.00                 34.00

2008-01-03 00:00:00.000 0001       0.00                  65.00                 622.00

2008-01-04 00:00:00.000 0001       9.00                  32.00                 84.00



(4 row(s) affected)

Notice that we have different columns for Budget, Actual and Projection, which is not really a great database design.  Much better would be to break this data out so that we have a single ‘Amount’ column and a ‘TransactionType’ column that specifies the type of each transaction.  We can transform our Transactions tables into this format using UNION ALL:

select TranDate, Account, ‘BudgetAmount’ as Type, BudgetAmount as Amount from transactions

union all

select TranDate, Account, ‘ActualAmount’ as Type, ActualAmount as Amount from transactions

union all

select TranDate, Account, ‘ProjectionAmount’ as Type, ProjectionAmount as Amount from transactions



TranDate                Account    Type             Amount

———————– ———- —————- ———————

2008-01-01 00:00:00.000 0001       BudgetAmount     354.00

2008-01-02 00:00:00.000 0001       BudgetAmount     14.00

2008-01-03 00:00:00.000 0001       BudgetAmount     0.00

2008-01-04 00:00:00.000 0001       BudgetAmount     9.00

2008-01-01 00:00:00.000 0001       ActualAmount     65.00

2008-01-02 00:00:00.000 0001       ActualAmount     65.00

2008-01-03 00:00:00.000 0001       ActualAmount     65.00

2008-01-04 00:00:00.000 0001       ActualAmount     32.00

2008-01-01 00:00:00.000 0001       ProjectionAmount 58.00

2008-01-02 00:00:00.000 0001       ProjectionAmount 34.00

2008-01-03 00:00:00.000 0001       ProjectionAmount 622.00

2008-01-04 00:00:00.000 0001       ProjectionAmount 84.00



(12 row(s) affected)

Or, we can use the UNPIVOT operator to do the same much easier:

select TranDate, Account, Type, Amount

from Transactions

unpivot (Amount for Type in (BudgetAmount, ActualAmount, ProjectionAmount)) as Amount


TranDate                Account   Type                     Amount

———————– ——— ———————— ———————

2008-01-01 00:00:00.000 0001      BudgetAmount             354.00

2008-01-01 00:00:00.000 0001      ActualAmount             65.00

2008-01-01 00:00:00.000 0001      ProjectionAmount         58.00

2008-01-02 00:00:00.000 0001      BudgetAmount             14.00

2008-01-02 00:00:00.000 0001      ActualAmount             65.00

2008-01-02 00:00:00.000 0001      ProjectionAmount         34.00

2008-01-03 00:00:00.000 0001      BudgetAmount             0.00

2008-01-03 00:00:00.000 0001      ActualAmount             65.00

2008-01-03 00:00:00.000 0001      ProjectionAmount         622.00

2008-01-04 00:00:00.000 0001      BudgetAmount             9.00

2008-01-04 00:00:00.000 0001      ActualAmount             32.00

2008-01-04 00:00:00.000 0001      ProjectionAmount         84.00



(12 row(s) affected)

Example #3: “Unsummarizing” Data

For those who work with accounting systems, this example may be familiar to you.   Many times, Accounting systems have “summary” tables that roll up transactional data into a structure like this:

create table AccountBalances

(

    CompanyID int,

    AccountID int,

    TransactionTypeID int,

    Year int,

    Period1 money,

    Period2 money,

    Period3 money,

    Period4 money,

    Period5 money,

    Period6 money,

    Period7 money,

    Period8 money,

    Period9 money,

    Period10 money,

    Period11 money,

    Period12 money

)



insert into AccountBalances

select 1,1,1,2008,200,300,400,500,400,0,0,0,0,0,0,0 union all

select 1,2,1,2008,100,100,100,100,100,100,100,0,0,0,0,0 union all

select 1,3,1,2008,150,0,50,10,10,200,400,45,0,0,0,0

(As before, let’s not worry about those foreign key constraints)

These tables are often calculated when transactions are posted, or periods are closed.   Typically, many reports pull from these tables because it is much more efficient than summarizing thousands or millions of transactions, and the data is already “cross-tabbed” the way most reporting tools would like to display it.

We can take this summarized data and “unpivot” it so we can still access the summarized data, but now it will be in a normalized structure.  All it takes is a simple UNPIVOT like this:

select
       CompanyID,
       AccountID,
       TransactionTypeID,
       Year,
       substring(Period,7,2) as PeriodNo,
       Amount


from
       AccountBalances


unpivot
  (Amount for Period in (Period1,Period2,Period3,Period4,Period5,Period6,
                         Period7,Period8,Period9,Period10,Period11,Period12)
   ) as Amount




CompanyID   AccountID   TransactionTypeID Year        PeriodNo Amount

———– ———– —————– ———– ——– ———————

1           1           1                 2008        1        200.00

1           1           1                 2008        2        300.00

1           1           1                 2008        3        400.00

1           1           1                 2008        4        500.00

1           1           1                 2008        5        400.00

1           1           1                 2008        6        0.00

1           1           1                 2008        7        0.00

1           1           1                 2008        8        0.00

1           1           1                 2008        9        0.00

1           1           1                 2008        10       0.00

1           1           1                 2008        11       0.00

1           1           1                 2008        12       0.00

1           2           1                 2008        1        100.00

1           2           1                 2008        2        100.00

1           2           1                 2008        3        100.00

1           2           1                 2008        4        100.00

1           2           1                 2008        5        100.00

1           2           1                 2008        6        100.00

1           2           1                 2008        7        100.00

1           2           1                 2008        8        0.00

1           2           1                 2008        9        0.00

1           2           1                 2008        10       0.00

1           2           1                 2008        11       0.00

1           2           1                 2008        12       0.00

1           3           1                 2008        1        150.00

1           3           1                 2008        2        0.00

1           3           1                 2008        3        50.00

1           3           1                 2008        4        10.00

1           3           1                 2008        5        10.00

1           3           1                 2008        6        200.00

1           3           1                 2008        7        400.00

1           3           1                 2008        8        45.00

1           3           1                 2008        9        0.00

1           3           1                 2008        10       0.00

1           3           1                 2008        11       0.00

1           3           1                 2008        12       0.00



(36 row(s) affected)

We can filter so that only periods with a non-zero amount are included, and we can SELECT FROM this result set and get the exact data we need for whatever date range we want without worrying which column the actual data is in.

This can also be done with a CROSS JOIN or a UNION ALL, but with 12 values to pivot, those options become much longer to write and UNPIVOT appears to be the way to go in this case.

Example #4: Multiple unpivots

Finally, let’s consider a more complicated example.  Here, we have a table that stores games played between two teams, where one team is the HomeTeam and the other is the AwayTeam:

create table Teams

(

    TeamCode char(3) primary key not null,

    TeamName varchar(100) not null

)



create table Games

(

    GameDate datetime,

    HomeTeam char(3) references Teams(TeamCode),

    AwayTeam char(3) references Teams(TeamCode),

    HomeScore int,

    AwayScore int,

    primary key (GameDate, HomeTeam),

    constraint pk2 unique (GameDate, AwayTeam),

    check (HomeTeam <> AwayTeam)

)



insert into Teams

select ‘BOS’,'Boston Red Sox’ union all

select ‘NYY’,'New York Yankees’



insert into Games

select ‘2008-04-01′,’BOS’,'NYY’,3,1 union all

select ‘2008-04-02′,’BOS’,'NYY’,6,4 union all

select ‘2008-04-03′,’BOS’,'NYY’,2,3 union all

select ‘2008-04-08′,’NYY’,'BOS’,6,0 union all

select ‘2008-04-09′,’NYY’,'BOS’,2,6 union all

select ‘2008-04-10′,’NYY’,'BOS’,1,10



GameDate                HomeTeam AwayTeam HomeScore   AwayScore

———————– ——– ——– ———– ———–

2008-04-01 00:00:00.000 BOS      NYY      3           1

2008-04-02 00:00:00.000 BOS      NYY      6           4

2008-04-03 00:00:00.000 BOS      NYY      2           3

2008-04-08 00:00:00.000 NYY      BOS      6           0

2008-04-09 00:00:00.000 NYY      BOS      2           6

2008-04-10 00:00:00.000 NYY      BOS      1           10



(6 row(s) affected)

That may or may not be the best possible design for the Games table, but it is a common way to do it.  With data in that form, if we want to get the total runs scored per team across all games, or get each teams won-loss record, we need to make two passes through the table. This can be done fairly easily and efficiently with a union:

select GameDate,

    HomeTeam as TeamCode,

    ‘Home’ as HomeOrAway,

    HomeScore as Score,

    case when HomeScore > AwayScore then 1 else 0 end as Win,

    case when HomeScore < AwayScore then 1 else 0 end as Loss,

    case when HomeScore = AwayScore then 1 else 0 end as Tie

from

    Games

union all

select GameDate,

    AwayTeam as TeamCode,

    ‘Away’ as HomeOrAway,

    AwayScore as Score,

    case when HomeScore < AwayScore then 1 else 0 end as Win,

    case when HomeScore > AwayScore then 1 else 0 end as Loss,

    case when HomeScore = AwayScore then 1 else 0 end as Tie

from

    Games

   

GameDate                TeamCode HomeOrAway Score       Win         Loss        Tie

———————– ——– ———- ———– ———– ———– —–

2008-04-01 00:00:00.000 BOS      Home       3           1           0           0

2008-04-02 00:00:00.000 BOS      Home       6           1           0           0

2008-04-03 00:00:00.000 BOS      Home       2           0           1           0

2008-04-08 00:00:00.000 NYY      Home       6           1           0           0

2008-04-09 00:00:00.000 NYY      Home       2           0           1           0

2008-04-10 00:00:00.000 NYY      Home       1           0           1           0

2008-04-01 00:00:00.000 NYY      Away       1           0           1           0

2008-04-02 00:00:00.000 NYY      Away       4           0           1           0

2008-04-03 00:00:00.000 NYY      Away       3           1           0           0

2008-04-08 00:00:00.000 BOS      Away       0           0           1           0

2008-04-09 00:00:00.000 BOS      Away       6           1           0           0

2008-04-10 00:00:00.000 BOS      Away       10          1           0           0



(12 row(s) affected)

Now, how can we do this with UNPIVOT?  This example is a bit more complicated, because we are “unpivoting” not only the score, but also the TeamCode, and we are also calculating a few extra columns (Win, Loss and Tie).

So, can we UNPIVOT more than 1 column?  Let’s start simply and do things one at a time.  First, let’s UNPIVOT just the TeamCode:

select GameDate, HomeOrAway, Team

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team



GameDate                HomeOrAway     Team

———————– ————– —-

2008-04-01 00:00:00.000 HomeTeam       BOS

2008-04-01 00:00:00.000 AwayTeam       NYY

2008-04-02 00:00:00.000 HomeTeam       BOS

2008-04-02 00:00:00.000 AwayTeam       NYY

2008-04-03 00:00:00.000 HomeTeam       BOS

2008-04-03 00:00:00.000 AwayTeam       NYY

2008-04-08 00:00:00.000 HomeTeam       NYY

2008-04-08 00:00:00.000 AwayTeam       BOS

2008-04-09 00:00:00.000 HomeTeam       NYY

2008-04-09 00:00:00.000 AwayTeam       BOS

2008-04-10 00:00:00.000 HomeTeam       NYY

2008-04-10 00:00:00.000 AwayTeam       BOS



(12 row(s) affected)

OK, so far so good.  Now, how do we get the score for the team as well?  Let’s add another UNPIVOT clause to the SELECT, this time for Score:

select GameDate, HomeOrAway, Team, Score

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team

unpivot (Score for HomeOrAway in (HomeScore, AwayScore)) as Score



Msg 265, Level 16, State 1, Line 1

The column name “HomeOrAway” specified in the UNPIVOT operator conflicts with the existing column name in the UNPIVOT argument.

Msg 8156, Level 16, State 1, Line 1

The column ‘HomeOrAway’ was specified multiple times for ‘Score’.

Hmmm. OK, we cannot specify the same column for both pivots.  That is just an alias specification, no big deal, so let’s just alias it as “HomeOrAway2″ :

select GameDate, HomeOrAway, Team, Score

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team

unpivot (Score for HomeOrAway2 in (HomeScore, AwayScore)) as Score



GameDate                HomeOrAway Team Score

———————– ———- —- ———–

2008-04-01 00:00:00.000 HomeTeam   BOS  3

2008-04-01 00:00:00.000 HomeTeam   BOS  1

2008-04-01 00:00:00.000 AwayTeam   NYY  3

2008-04-01 00:00:00.000 AwayTeam   NYY  1

2008-04-02 00:00:00.000 HomeTeam   BOS  6

2008-04-02 00:00:00.000 HomeTeam   BOS  4

2008-04-02 00:00:00.000 AwayTeam   NYY  6

2008-04-02 00:00:00.000 AwayTeam   NYY  4

2008-04-03 00:00:00.000 HomeTeam   BOS  2

2008-04-03 00:00:00.000 HomeTeam   BOS  3

2008-04-03 00:00:00.000 AwayTeam   NYY  2

2008-04-03 00:00:00.000 AwayTeam   NYY  3

2008-04-08 00:00:00.000 HomeTeam   NYY  6

2008-04-08 00:00:00.000 HomeTeam   NYY  0

2008-04-08 00:00:00.000 AwayTeam   BOS  6

2008-04-08 00:00:00.000 AwayTeam   BOS  0

2008-04-09 00:00:00.000 HomeTeam   NYY  2

2008-04-09 00:00:00.000 HomeTeam   NYY  6

2008-04-09 00:00:00.000 AwayTeam   BOS  2

2008-04-09 00:00:00.000 AwayTeam   BOS  6

2008-04-10 00:00:00.000 HomeTeam   NYY  1

2008-04-10 00:00:00.000 HomeTeam   NYY  10

2008-04-10 00:00:00.000 AwayTeam   BOS  1

2008-04-10 00:00:00.000 AwayTeam   BOS  10



(24 row(s) affected)

Holy schnikies, it works!  We can specify more than one UNPIVOT clause for the same SQL statement!  Who would have thunk it?

Uh oh — wait a second.  We have 24 results returned.  We should have only 12.  Something is not right here.  Remember when we aliased our second unpivot column as “HomeOrAway2″?  We did not return that anywhere in our results.  Let’s add that in and take a look:

select GameDate, HomeOrAway, HomeOrAway2, Team, Score

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team

unpivot (Score for HomeOrAway2 in (HomeScore, AwayScore)) as Score



GameDate                HomeOrAway HomeOrAway2 Team Score

———————– ———- ———– —- ———–

2008-04-01 00:00:00.000 HomeTeam   HomeScore   BOS  3

2008-04-01 00:00:00.000 HomeTeam   AwayScore   BOS  1

2008-04-01 00:00:00.000 AwayTeam   HomeScore   NYY  3

2008-04-01 00:00:00.000 AwayTeam   AwayScore   NYY  1

2008-04-02 00:00:00.000 HomeTeam   HomeScore   BOS  6

2008-04-02 00:00:00.000 HomeTeam   AwayScore   BOS  4

2008-04-02 00:00:00.000 AwayTeam   HomeScore   NYY  6

2008-04-02 00:00:00.000 AwayTeam   AwayScore   NYY  4

2008-04-03 00:00:00.000 HomeTeam   HomeScore   BOS  2

Minimize a DropDownList’s ViewState

Let’s say you have a very large DropDownList with lots of values and text.  We need to maintain ViewState in this DropDownList so that we can retrieve the selected value on a post back.   Of course, this means that now the ViewState contains the data for every single value in the list, both values and text included.   Even though the page itself may be fairly simple and lightweight, the result of having this simple DropDownList on the page is that the page size is quite large and the amount of data passed back and forth on a postback is very large as well.

If you have an efficient database, and/or if you are caching the data anyway, you might not mind re-loading the list items each time the page posts back to eliminate the need for the page itself to hold all of this ViewState data. 

However, if you turn off ViewState on the DropDownList, you will notice that it now does not remember the selected value on post backs.  The solution is very simple — just manually set the DropDownList after re-loading it to the value from the HTML form post.

To do this, first we disable ViewState on the DropDownList (EnableViewState=”False”).  Then, instead of a typical PageLoad() method like this:

 

        protected void Page_Load(object sender, EventArgs e)

        {

            if (!IsPostBack)

            {

                loadList();              

                // other stuff here

            }

        }

      
we would load the list every time the page posts back:

        protected void Page_Load(object sender, EventArgs e)

        {

            loadList();     

 

            if (!IsPostBack)

            {

                 // other stuff here

            }

        }

       
And, in that loadList() method, instead of simply doing this:

        void loadList()

        {

            dlTest.DataSource = // get data from DB

            dlTest.DataBind();

        }

       
you also would check to see if it has a posted value:

        void loadList()

        {

            dlTest.DataSource = // get data from DB

            dlTest.DataBind();

 

            string r = Request[dlTest.UniqueID];

            if (r != null)

                dlTest.SelectedValue = r;

        }

That loads the DropDownList each time and ensures that the value set is what was posted back.  Thus, the control will now “remember” the selected value on each postback, but without requiring any data stored in the ViewState at all. 

It would also be easy to implement this logic in a control that inherits from DropDownList; say, a LightDropDownList.  Techniques like this will of course work on other controls as well, such as a ListBox or a CheckedList.

With fast database connections and performance, as well as data caching, sometimes persisting data on the page is not the optimal way to go.  Like always, there are different ways to skin a cat and sometimes simple little tweaks like this can have huge benefits on performance.

Update:  See the comments for other options are that even easier/better.  Also in the comments, Richard has provided a link to this excellent post on ViewState that I highly recommend checking out.  It’s long, but very informative and quite entertaining as well.   Thanks for the great feedback.