Archive for the 'Miscellaneous' Category

The MailBag — Super-Sized Edition! String Parsing, Crosstabs, SQL Injection, and more.

OK, boys and girls, it’s time for the mailbag!  There’s lots of stuff to cover, so let’s get to it!

Greg E writes:

Hello Jeff,



I just found your blog and wanted to know if you could point me in the right direction or possibly toss me a solution.



I am looking at a badly formed telelphone number column in a MS SQL Server db. Entries contain ‘(555) 555-1212′ or ‘555.555.1212, etc. Do you know how I would go about stripping out unwanted characters from the telephone number?



Thanks for the brain cycles.

Greg — A simple UDF should do the trick for you.  For example, something like this:

create function NumbersOnly(@txt varchar(1000))

returns varchar(100)

as

begin

    declare @i int

    declare @ret varchar(100)

  

    select @i = 1, @ret = ”



    while (@i <= len(@txt))

        select @ret = @ret + case when substring(@txt,@i,1) like ‘[0-9]‘

                                  then substring(@txt,@i,1) else ”

                             end,

               @i = @i + 1



    return @ret

end

With that, you can write something like this:

select ID, dbo.NumbersOnly(PhoneColumn) as PhoneNumbersOnly

from YourTable

Over at SQLTeam’s script library forum, there is a thread with a bunch of parsing functions that you may find useful if your needs are more complex.

And, in case you missed it, be sure to read this post.

In response to my blog post on passing arrays to stored procedures, Juan writes:

I know is not the right solution, but I have to say it for the sake of completeness of the discussion: if the amount of items in your “array parameter” is limited (say, for example 5 or 10 items), you can always use optional parameters (i.e. assign them to null when declaring them in the SP), then insert them in a temp table or do whatever you want with them, without using dynamic,nor xml, nor string manipulation.

Great point, something I missed in my article entirely.  Sometimes, it may make sense to declare @Val1, @Val2, … @ValN parameters if there aren’t too many and there’s a clearly defined limit.  Thanks for bringing that up, Juan.  The simplest solution is usually the best, and in some cases that’s probably all you need.  You still have clearly defined parameters with strong typing and no parsing, and those are the main issues with CSV parameters that I wanted to avoid.

Marc writes:

We have three tables.  They all share the same “type” of primary key: let’s say ActivityCode.  I need to pull data using an ActivityCode, but there is a catch.  If table 1 has the data, I want to use it.  If table 2 has the data and Table 1 does not I want to use Table 2.  If table 3 has the data and Table 1 and Table 2 does not, I want to use that.  The ActivityCode can be found in both Table 1 and Table 2.  Once I determine which table i am using I will need to do several other inner and/or outer joins with other tables.  I am using JDBC.  I want to be able to do this using a single SQL statement, but I am willing to use multiple statements if it makes more sense.  I just need to keep it to a single transaction under JDBC.

Marc — I think what you are looking for is described here.  The key is to OUTER JOIN to all of your tables, and then use a CASE expression to determine which of those joined tables has the data you need.

Mary writes:

I struggled with a thorny SQL problem all day yesterday and found your post on set based thinking very helpful.  I needed to write an update query that updated a table with many records with the same key from a table with the key and the corresponding new value.  The table with new values didn’t exist - I had to derive it from a different table showing the key, new value and date (the new value changed over time.)



Your observations that one needs to break the problem down into its simplest components helped me realize something else.



I made the classic rookie error of grabbing some code that did a similar type of update and try to hack it into my solution.  When I finally realized I was going in the wrong direction (because my solution was getting messier and messier), I went back to the beginning.



I defined the problem in its simplest terms and learned I could do a simple “update  A set A.value = B.value from A join B on B.key = A.key” .



I didn’t realize I could update from but once the problem was simply defined a quick question to one of our senior engineers resulted in a quick answer leading to an elegant solution.  The whole thing was completed in less than half an hour.



The moral of the story:  Define the problem first!  Don’t even think about syntax until you have written a clear, concise spec from the problem just defined.  Then if you find yourself spending an inordinate amount of time and/or the solution seems too messy or seems to run too long - google or talk to your colleagues.



Thanks for a great blog; your post made me realize it’s more about how we think than throwing code at the problem - the code should be the last thing!

Thanks, Mary!  I’m glad I would be of assistance.  The “moral” that you wrote says it all. 90% of programming isn’t writing code at all, it is simply defining what your code will do — and that’s always the hardest part! 

In response to my post on grouping by month, Mark writes:

I’m so close! I’ve tried all the things in this article, but can’t seem to do what I want to do. I’ve been tearing my hair out for days! Here’s what I’m trying to do.

Basically I need a sql procedure that looks at an invoicing table that totals amounts by month/year and quarter at the same time. Here’s how my table looks:

Project ID  Date      Amount
1                3/11/08    10.00
1                4/18/08    10.00
1                6/22/08    10.00
2                3/01/08    10.00
2                9/15/08    10.00

I would like the output to have dynamic columns, so an output may look like:
Project ID  Jan’08  Feb’08  Mar’08  Q1′08  Apr’08  May’08  Jun’08  Q2′08  Jul’08  Aug’08  Sep’08  Q3′08
1              0.00    0.00      10.00    10.00  10.00    0.00      10.00  20.00  0.00    0.00      0.00      0.00
2              0.00    0.00      10.00    10.00  0.00      0.00      0.00    0.00  0.00    0.00      10.00    10.00

I would like the query to know if there was no value in Jan &amp; Feb’08, but still list all the months in Q1.

I’m not opposed to using a calendar table, but would like to try to avoid it if possible.

Any help would be greatly appreciated!

Hi Mark — First off, never be afraid to use a calendar table!  There is nothing hacky or unusual or tricky about them, they can make your life much easier, your code much shorter, and everything much more efficient.  If grouping by month or some other time period is important to your reporting,  then defining those months in a permanent, nicely indexed table makes perfect sense.

In this case, though, since you are outputting one column per month for a single year, I recommend to simply use CASE expressions to “cross tab” your data.  You can alias your columns as M1,M2,M3…M12 and Q1-Q4 so that no matter what year you are running the report for, your columns will be consistently named, and you can let your presentation layer handle outputting nice column headers with the current year/month for each one.

So, all you really need is something like this:

select projectID, Y as [Year],

  sum(case when m=1 then amount else 0 end) as M1,

  sum(case when m=2 then amount else 0 end) as M2,
  sum(case when m=3 then amount else 0 end) as M3,
  sum(case when m in (1,2,3) then amount else 0 end) as Q1,

  …

  sum(case when m=12 then amount else 0 end) as M12,
  sum(amount) as Total

from

(

  select projectID, Amount, DatePart(Month, [Date]) as M, DatePart(Year, [Date]) as Y
  from YourTable

  where [Date] >= @StartDate and Date < @EndDate

) x

Of course, you’d define @StartDate and @endDate as ‘01-01-2008′ and ‘01-01-2009′, respectively.

In this comment, Stewy writes:

I have an issue with both DISTINCT and GROUP BY.

The issue is that using either one, the results comes back ordered as if using order by.

I need the unique results in the order they are in the database. How can I do this? Thanks

Stewy — Relational databases have no obligation to store data in any specific order, or to keep track of the order that things were entered, or to return things “as they are in the database.”  There is no such thing as getting data out “the way it is stored” because a relational database may move or re-order data temporarily to efficiently execute a query depending on indexes available.   You must always explicitly specify how you want your results using an ORDER BY clause.  If you want to keep track of the order that you added data to a table, you should have a “timestamp” column that records the exact moment each row was added via a DEFAULT value or a trigger.  Or, at the very least, you can use an IDENTITY.  Then, you can simply order by that column.  This is a very important concept to understand when working with relational databases.  Things are returned based on the data itself, not based on physical storage characteristics.  I hope this helps.

In response to Always Use Parameters, Karuna writes:

Hi Jeff,
Just wondering if I build the Sql in Stored Procedure (Dynamic Sql) based on the parameters passed to stored proc, will it still be a possible candidate for Sql Injection? Basically I want to build the Sql in the stored procedure instead of doing it in .Net code as displayed in the article.

Dim cm As New SqlCommand(”", YourConnection)
cm.CommandText = “DELETE FROM YourTable WHERE ID=@ID “
cm.Parameters.Add(”@ID”, SqlDbType.Int).Value = ID

If Name <> “” Then
cm.CommandText &= ” And Name=@name”
cm.Parameters.Add(”@Name”, SqlDbType.VarChar).Value = Name
End If
If TranDate <> DateTime.MinValue Then
cm.CommandText &= ” And TranDate = @TranDate”
cm.Parameters.Add(”@TranDate”, SqlDbType.DateTime).Value = TranDate

Hi Karuna — you are absolutely 100% safe from SQL Injection by doing this.  Remember, SQL Injection is not about genereal SQL concatenation or about building a SQL statement dynamically.  It only can happen when you concatenate user input into a SQL string and execute it.  If you put together a big SQL statement via concatenation but you only incorporate user input via parameters, there’s no need for scrubbing data or worrying in any way about SQL Injection — it will never happen, under any circumstance. 

Avoiding SQL Injection is the easiest thing in the world — simply do things the easy and correct way and you’ll never need to worry about it.   It’s like if there was a big controversy in the news about thousands of people crashing their cars because they are driving them with decorative tin foil covering their windshields, and asking the experts “how can we solve this crisis?”   Should we cut holes in the tin foil, or add mirrors, or incorporate a camera and a tv monitor?  Uh .. no.  You should just take the tin foil off of your windshield and do things the easy, simple and correct way and don’t make things over complicated.   That’s basically what this whole SQL Injection thing is about — bad programmers doing stupid things when all they need to do is write decent code the easy way — simply by using parameters.

Gocs writes:

I have tried to compute the number of hours based on the datetime in MS SQL 2005.  However, I am not sure the hours is correct.  Do you have any idea on how to do it correctly?

Gocs — I think you really need to read this very carefully.  I’ll be waiting!

The Golden Rule of Data Manipulation

Introduction

There is a very simple rule when it comes to storing (and returning) data, which I see violated all the time, making life so much more complicated for everyone involved.  In case you haven’t noticed, that’s a common theme I discuss here on this blog — different ways programmers make life more difficult for themselves, instead of simply following good practices and doing things the easy way.  This is yet another example of that situation.

The “Golden Rule of Data Manipulation” is a simple, but important rule that you should always follow when designing a database,  writing database code, or really writing any application code at all for that matter:

“It is always easier and more flexible to combine data elements rather than to break them apart”

In other words: Concatenation is easy. Parsing is hard.  Often, very hard — or even impossible depending on the data.

Problems with Parsing

It is amazing how often I see people struggling with “difficult SQL problems” such as:

  • Working with CSV lists of values in a single column, such as “1,3,56,2″
  • Breaking out a FirstName/MiddleName/LastName/Suffix from a single “Name” column
  • Parsing address strings into City/State/ZIP, or Number/Street/Unit
  • Parsing Phone Numbers to get just an area code, or to take different phone formats and present them all uniformly
  • Figuring out how to calculate the Day,Month, and/or Year from different string values such as “23-Jan-08″, “2008-02″, “20070303″, “03032007″

And on and on it goes….

Now, sometimes you inherit or import data that needs to be parsed — that’s a fact of life.  You’ve got to figure out how to do it, and the key in those cases is to accept that because the data itself is essentially random, nothing you can write will perfectly work 100% of the time on all of it.  Often, the best you can do is handle most of the data, and then do some manual clean up. 

Parsing strings can be a very difficult task for any programmer, and the challenge isn’t writing the code, it’s coming up with the algorithm (another common theme on this blog).  Consider my new favorite example of why parsing a single Name column into a First/Middle/Last is not as easy as it seems:

    Oscar De La Hoya

How would your algorithm parse that one?  Never mind prefixes such as “Dr.” and suffixes such as “Jr.”! 

Please don’t interpret what I am saying as a programming challenge — I understand that it is possible to write long code with a list of exceptions or rules and have that algorithm work pretty well in most cases.  The point is that writing that algorithm is a lot of work, running it will be inefficient, and it will never be exact because the data itself that is being processed is essentially random.  It’s just like the old saying: “garbage in, garbage out”.  Still one of my favorites, after all these years, and it still applies!

A Data Model that requires Parsing = A Poor Data Model

So, we need to accept that sometimes you’ve got to parse data like this.  And that’s OK; it happens, it can be done, even if some manual work is often involved.

However, there’s no excuse when you design your database, your SQL code, or applications so that free-form data must be parsed, when you can simply design it correctly in the first place and store your data already broken out into the smallest possible units with the correct data types.

If breaking out a contact’s name into First, Last, Middle, etc is important to your application, then you should force the point of data entry to accept input broken out into those columns.  The same goes for phone numbers, addresses, and so on.  Any time you have the option of accepting  input as clean, short, raw, segments of data you should always do it.   Once you have data at that smaller resolution, it is trivial to combine it any way that you want for presentation, formatting, filtering, and so on.

It may seem like overkill to break out a phone number into 4 columns:

AreaCode
Exchange
Number
Extension

   
And, in fact, it might be more complicated than that if you need to deal with international phone numbers.  You may look at your tables, and your code, and even the UI that accepts these fields and think “that is way too precise and unnecessary, breaking out phone numbers like this sure makes things complicated!”

But by doing this, and only accepting user input that follows precise rules of what is allowed in these fields, and storing each of them in their own column, you can now easily and efficiently:

  1. Sort these numbers any way you want, without worrying about extra characters like parenthesis or dashes, or leading 1s, messing things up
  2. Filter quickly on an area code without the need to use LIKE, and again worrying about extra characters getting in the way
  3. Present the phone number quickly and easily any way you want without any parsing, be it as 123.123.1345 x123 or “(123) 123-1345 extension 123″, or anything you want.
  4. Validate your phone numbers, ensuring you have all the necessary parts and they are the proper length, without worrying about parsing strings

Considering doing any of those things if your data is stored in random strings like:

1-123-124-1234
(123) 124-1234
123-124-1234, ex. 123
123.124.1234 x123
(123)124.1234 ext. 123
1 123 124 1234 123

   
and so on …  Not so easy in that case, just as parsing simple “Name” columns into First/Last, or addresses into  Number/Street/Unit is not so easy as well. 

Again, this is not a programming challenge — I am sure it can done. (In fact, phone numbers are generally the easiest because you can usually just ignore anything other than digits.)  Most of us have done it before.  But designing something in such a way that parsing is required to do simple filtering, sorting, or formatting, is a bad design

It’s Not About the UI

As I wrote here, you should never think “I want to display phone numbers like 123.123.1234, so I should store them and return them that way.”  You should always think “How can I break this down into small, concrete parts that are easily validated and easy to combine any way I want at any time?”

So, what if you need fine detail when storing addresses, but you don’t want your UI to present Street Number, Street Name, Unit Type, Unit Number as different data entry fields for usability or aesthetic reasons?  That’s fine, but that doesn’t mean you should not set up your database properly.  Your UI can certainly still present that one single “Address” text box for the user to fill out, parse that text at data entry, show the user the parsed result in multiple fields, and ask “Please verify for your address” or something along those lines.   Then, if not, the user can tweak the results and save it.  If you do things along those lines, and focus on getting the data parsed and stored correctly at the earliest point possible, every other part of your code will be that much more efficient.

All of this applies not only to data storage, but to how data is returned and passed between tiers as well.  Again, if you just return separate columns to your client application, instead of focusing on making them “look nice” in your database code by returning nothing but long, “pre-formatted” strings, your client can simply concatenate and format those columns any way it needs.  And, different clients can format that same database output in different ways — all without ever altering any database code! 

Conclusion

In short, remember that writing concatenation is easy, efficient, and exact.  Writing a parsing routine, on the other hand, is often none of those things.   You may not always be able to control the design of the data you are working with, but be sure that when you can, you do it right.  If you find yourself using lots of LIKE expressions, or string parsing for simple data retrieval operations, something is wrong.   Time to fix up your database and your code, store the parsed and validated data permanently, and make things easier and cleaner for everyone.

Whether you are designing a schema, writing a SELECT, or writing code in any other programming language, remember that the Golden Rule of Data Manipulation always applies.  Accept this rule, learn from it, and practice it, and you might be surprised to find that programming isn’t quite as hard as you thought it was.

The Joy of Blog Feedback

Introduction

I have been writing my little blog here for some time now, and my favorite part of doing this is of course the feedback.  It’s always great to hear from the readers, to have mistakes corrected, to debate various topics and techniques, and to learn a lot about SQL and the various topics I discuss here. 

At this point, I have received over 1,700 comments over the years, and while all of them are truly appreciated, I have noticed that unfortunately many of the, uh, less helpful comments do seem to consistently fall neatly into various categories. 

Let’s take a look at an example of a simple, typical blog post and some of the responses that often come back.  If you write a blog of your own, or often read the feedback from other blogs, many of these may seem familiar to you. 

A Typical Blog Post


Today, I have a simple tip for beginner SQL programmers.  When writing a SELECT, you can add a WHERE clause to filter the results that are returned.

For example, to only return rows for CustomerID 345, you can write:

SELECT …
FROM YourTable
WHERE CustomerID = 345

As you can see, it is very simple. You can use any boolean expression to filter the results as needed. Try it out!  If you have any questions, let me know.

Some Typical Responses

The subtle blog spammer (that you initially mistake for a nice complement):

Very helpful site! Good advice!  From Joe at www.some-random-site-that-has-nothing-to-do-with-sql.com

The person who doesn’t seem to get it:

Ummm … what good does this do when I want to sort? You should fix the code.

The person that really doesn’t get it:

The problem with that is it will only return results for one customer.

The person that somehow takes away the exact opposite of what you wrote:

I disagree, this will not return all customers other than 345 and this is definitely not something for advanced SQL Programmers, it’s probably better for beginners.

The script kiddie (who just wants to cut and paste your code, not read or learn anything):

LOL, that doesnt even run 4 me!  I get errorz that sez “YourTable” does not exist!  Plz help!!  thnx!

The very clichéd, mindless “anti-Microsoft ranter”:

You only have to use WHERE clauses because Bill Gates wants more $$, you are a shill!! Micro$oft sucks, you should use an iPhone for this!  MySQL automagically filters results for you!

The “skimmer” (who just skims the post missing most of it):

Nice, but is there any way to filter for just one customer?

The “repeater” (who just repeats what you’ve already written):

A better solution is to write WHERE CustomerID = 345, it works better.  It is also fast because less rows are returned.  Using WHERE is a good way to filter a SELECT.

The “know-it-all complainer”:

That is the stupidest advice I ever read, why would you want to ever do this? Just use a parameter, or an ORM tool– this will not scale!  I sure hope CustomerID isn’t a VARCHAR — then you have an implicit conversion happening, your indexes are shot, your server will overheat, and your wife will leave you for your mechanic.  Also, 345 is too large if CustomerID is a tinyint.

The random, unrelated question asker:

Good advice. Thnx.  How to insert into the table?

The “misunderstander”:

If I add this to all of my scripts, only data for one customer will ever be returned.  I am not sure this is a good idea. Also, this code will not work in Java and doesn’t follow the HTML 4.0 specification. 

The very rare polite and helpful typo alerter:

Hey there, you have a typo in the first sentence — should be “filter”, not “fitler”! Just letting you know, thanks for a great post!

The much more common typo alerter:

You wrote FITLER not FILTER, your an idiot!! if you cannot write English how can you write SQL ???   Learn to spell!

Summary

Please, don’t misunderstand, I mean this all in good fun.  I love feedback, and please, keep it coming.  It’s what makes this and every other blog a fun place to visit.

In fact, I realize that I left out the most annoying feedback of all!  That’s right, the Thin-Skinned, Overly-Defensive Blog Author Who Feels the Need to Respond to Everything:

Did you even read what I wrote? I did not say that.  And, yes, I did spell “monkey” wrong, so sue me!  Remind me to fire my editor…. or maybe I should refund your subscription fee?  Oh, wait, this blog is free!  So what the heck are you complaining about?  Why don’t you go bother some MySQL blogger?  I hear they usually write at a 5th grade level which is probably more appropriate for your intellect. Jerk!

Yeah, comments like those are definitely the worst of all!  Thank you for putting up with my feedback, now that I think of it!

The Truth about “Cursor Busting” in SQL

Let’s say you are called in to troubleshoot a stored procedure that is performing poorly.

You dive in to investigate and this is what you find:

create procedure ProcessProducts

as

    declare @Products cursor, @ProductID int

    set @Products = cursor for select ProductID from Products order by ProductID

    open @Products



    fetch next from @Products into @ProductID



    while (@@FETCH_STATUS=0)

        begin

        exec DoSomething @ProductID
        fetch next from @Products into @ProductID

        end



    deallocate @Products

Ah ha! A cursor!  It seems we have identified the bottleneck: Clearly, the performance problems are because the code is not doing things in a set-based manner, but rather by processing rows one at a time using a dreaded cursor.  This cursor is opening up the Products table, looping through the rows one at a time, and calling the “DoSomething” stored procedure for each ProductID.  As we all know, cursors are not the way to go when writing SQL code; this cursor should eliminated and replaced with a cleaner, more efficient (and more socially acceptable!) solution.

So, how we do optimize this?  Well, a commonly suggested approach is to eliminate the CURSOR by replacing it with a WHILE loop:

    declare @ProductID int

    set @ProductID = -99999



    while (@ProductID is not null)

        begin

        set @ProductID = (select top 1 ProductID
                          from Products
                          where ProductID > @ProductID
                          order by ProductID asc)


        exec DoSomething @ProductID


        end

Instead of declaring a CURSOR to loop through the table, we now are using “set-based” code and our problems seem to be solved.  The cursor is gone, our code looks much cleaner, we’ve tested it and it works properly, so off to production it goes.  Another cursor has been busted!

Right?

Actually … no.

You see, eliminating cursors is not about syntax.  It is not about searching for the word “cursor” in your code and just replacing it with a WHILE loop that does the same thing.  Optimizing and replacing cursors involves much more.  We can never optimize any cursor code until we look deeper into what exactly is happening when we “process” each of those rows.  In this case, we need to find out what that “DoSomething” procedure is actually doing. 

Suppose the DoSomething procedure is generating a report and sending an email to the “Product Manager” for each product that contains status information, and then logging this email message into a table somewhere.

If that is the case, what have we just gained by replacing our CURSOR?  

Honestly — not much,  if anything at all.  Because of the task at hand, we may very well need to process rows in the Product table one-by-one to send our emails and generate the report, and the bottleneck here is not the cursor code at all, but rather the report generation and maybe sending the email.   Eliminating the cursor code probably gains us nothing here.  If you need to process rows one at a time, go ahead and use a cursor — that’s what they are there for!   Replacing a perfectly fine, simple cursor with a WHILE loop might even make your code longer, or more confusing, or even less efficient depending on circumstances. 

For example, what if we need to process the Products ordered by Region, then Product Name, for whatever reason.  Our cursor code is simple:

set @Products = cursor for

    select ProductID

    from Products

    order by Region, ProductName

All that we needed  to change was our ORDER BY clause.  Now, how would we write this as a WHILE loop?  Is it possible?  Sure.  Will it be as simple and clean as using a cursor?  No, it won’t. (Though ROW_COUNT() makes this much easier than it used to be)

Now, I am not here to say that cursors are “good”, but if you really need to process rows one by one, go ahead and proudly use a cursor.   Replacing cursors isn’t about processing rows one-by-one in a different way (i.e., using a WHILE loop instead), it is about not processing rows one-by-one at all!   

Let’s consider another scenario: What if the DoSomething stored procedure is checking to see if the Product’s ExpireDate is greater than today’s date, and if so, it is updates the Status column for that Product to ‘X’.

In that situation, what have we gained by rewriting ProcessProducts without a cursor, and using a WHILE loop instead?   The answer is, once again: nothing!  In fact, we potentially have once again made our code more confusing or even less efficient than a cursor might be!  Remember, the bottleneck isn’t the cursor syntax — it is the fact that we are processing rows one at a time.  Replacing the cursor with the WHILE loop didn’t solve this problem, did it?  

So, looking now at both of the scenarios I presented for the DoSomething stored procedure, it should be clear that we did not fix anything by replacing the cursor in either case simply by writing a WHILE loop.  If that’s all you are doing, don’t bother replacing the cursor at all.  You haven’t optimized anything.

As I said before, the art of replacing a cursor is not a find-and-replace syntax change operation — it is a fundamental change in how you process your data.  As in the Product report generation and email example, it may be that we simply need to process rows one by one, and thus no further optimization is possible from a SQL point of view.  In situations like updating the Product table, however, we do not need to process the rows individually — we can do everything in one single UPDATE statement.  Thus, in order to determine how to optimize the ProcessProducts stored procedure, we needed to dig deeper into entire process as a whole, which included examining the DoSomething stored procedure and determining the full scope of exactly what this “ProcessProducts” stored procedure is doing. 

So, if “DoSomething” is updating the Products table as specified, we now know that a good replacement for our cursor code doesn’t result in a WHILE loop and calling a separate stored procedure over and over at all — it results a true, set-based solution:

create procedure ProcessProducts

as

    Update Products set Status=’X’ where ExpireDate > getdate()

   

And THAT is how you optimize a cursor! No loops, no calling of another stored procedure for each row in a table, no “find-and-replace” cursor code removal.  We examined the entire process, and rewrote the entire process, to get it done quicker and shorter and faster without cursors or loops. 

Always remember: Replacing a cursor isn’t about rewriting your syntax, it is about redesigning your algorithm.

Log Buffer #98

Hello and welcome to the 98th edition of Log Buffer. My name is Jeff Smith and I will hosting this week’s exciting episode. If, for some reason, you are not completely satisfied with this edition, simply write in and complain to Dave over at The Pythian Group and you will receive Log Buffer #99 absolutely free! Now that is a guarantee you can feel good about. OK, let’s get to work.

I have only limited exposure to both PostgreSQL and MySQL, but I have often wondered why MySQL is so popular while it seems that PostgreSQL has the superior features.  Over at Xaprb, they attempt to answer that very question.  Be sure to read the comments from that post, and check out the big discussion from that article over at reddit as well.  The theory I like the best?  MySQL is easier to pronounce!  (How do you pronounce “PostgreSQL” anyway?)

Speaking of MySQL, Sheeri Cabral points out that MySQL’s website certainly doesn’t do the product any favors, and there’s also a good discussion at Xaprb on why MySQL is Free Software but not Open Source.  If you ever wanted to add a new Unicode collation to MySQL, Alexander Barkov and Peter Gulutzan provide all the information you’ll need.  Peter at the MySql Performance Blog tells us that MySQL lacks a good memory profiling tool, and based on his feedback, others seem to agree.  (No, not those Others!)  Speaking of MySQL feature requests, Justin Swanhart asks “Why does INFORMATION_SCHEMA fail to show information about TEMPORARY tables?”  and also lets us know that his materialized view stored procedures for MySQL have been OKed for releaseSunny Walia (what a great name — is it possible to not be a fun person with a name like that?) tells us how to install innotop to monitor innodb information in real-time and wonders “Oh dear MySQL slave, where did you put those rows?”  Going back to the MySQL Performance Blog, Vadim warns us of a dangerous MySQL command; be sure to keep that one locked safely away from the kids.

Regarding a product I actually know a little about, Kalen Delany has a nice list of Free SQL Server Troubleshooting Tools to check out.  If you haven’t seen it yet, my co-blogger here at SQLTeam Mladen has an amazingly popular list of Free SQL Server Tools that might make your life a little easier that was published a while back but it is always worth mentioning.  While you are visiting Mladen’s blog, don’t miss his latest post on getting immediate deadlock notifications for SQL Server 2005.  Also, be sure to leave him lots of comments telling him that his blog is great but that he is your second favorite SQL Server blogger — after me, of course!

Still on the topic of SQL Server, Jamie Thomson provides us with a tip for ensuring that your root folder is valid when using SSIS.  Denis Gobo asks: What did you do to master SQL?   (Interestingly enough, for me it was by learning MS Access first!)  Tony Rogerson warns us of the performance implications of using Row_Number() in non-recursive CTE’s.  And Paul S. Randal describes a CHECKDB bug that people are hitting; thankfully, he says that “you can only hit this bug if you ALREADY have corruption, that it’s quite rare, and that there is a workaround.”

Everyone enjoys a good analogy, right?  After all, a good analogy is like an ice cream cone: they both are … hmmm … OK, well, that’s not a good analogy at all.  Never mind. Speaking of bad analogies, I bet that unlike Peter Gulutzan you never really thought about the expression “half baked” before and how it relates to MySQL features.  Well, now there’s your chance!  (Of course, a pessimist would prefer “half un-baked”, but that’s a discussion for another time.)

A big topic lately has been SQL Injection attacks.  I always find this funny because this is the easiest problem to avoid in the history of programming; as CodeAssembly tells us, “Never concatenate user input to your queries, without exceptions.”  That’s really all there is to it — do that, and you are good to go.  As I’ve written before, using parameters is not only safer, but your code is much shorter and simpler than if you concatenate strings all day long. 

Federico Cargnelutti gives us an introduction on managing and applying database changes with LiquiBase, an “open source, DBMS-independent library for tracking, managing and applying database changes.”  I have never used LiquiBase, but sounds like something worth looking into.

While reading Magnus Hagander’s PostgreSQL Blog, I found out that Yahoo claims it has the largest SQL database in a production environment — and they use PostgreSQL.  Impressive!  Peter Eisentraut checks in from PGCon Day One, which included a presentation of his on porting Oracle Applications to PostgreSQL.   For those of you out there using Max OS X, Perldiver has summarized instructions on building PostgreSQL on Mac OS X.  Going off on a tangent, they just opened a new Apple store here in Boston on Boylston street.  I visited it this weekend after getting my usual bad haircut next door.  My verdict on the store?  It sure looks nice, but I had no luck finding a new 5 1/4″ floppy drive for my Apple II.  Try to do better next time, Apple!

James McGovern offers some praise for Mark Wilcox of Oracle.  Why? Because Mark has been doing some must-read blogging over at the Oracle.com blogs.  Getting back to my favorite topic, which is coding SQL, Michael Armstrong-Smith instructs the Oracle crowd on using CASE to solver Outer Join issuesShay Shmeltzer provides some tips on creating a master with two details on the same page when using ADF. 

Sticking with Oracle links, Pete Finnigan ponders read only tables or read only users, and notes that in Oracle a read-only user “has approximately 27,000 other privileges because of grants to PUBLIC. This is the killer issue as because of this it is in fact not possible to create a read-only user.” Hmm … only 27,000?  Come on, that doesn’t seem that bad to me!  Eddie Awad tells us about the Lazy Developer’s way to populate a Surrogate Key and over at the Oracle Scratchpad, Jonathan Lewis provides some helpful links on Index Efficiency.  Finally, if you are looking to install Oracle Database 11g Release 1 on Fedora 9 (and who isn’t?), everything you need to know is covered over at Oracle-Base

Now, if you’re like me, you hate DBAs.  Ah, just kidding, of course we all love our Database Administration Overlords (and I’m not just saying that because most of the people reading this probably are DBAs.)  However, even the best DBAs out there occasionally make mistakes.  If you have some horror stories of your own to share, or if you simply want to take pleasure in the misfortune of others, be sure to check out Kalen Delany’s call for DBA Blunders.  (Of course, to be fair, even us developers occasionally make mistakes.) To help avoid future blunders, consider this advice: Do not use Windows System Restore as a backout plan for SQL Server Service Packs, Cumulative Updates, or HotFixes.  Also, Tara here at SQLTeam reminds us to optimize your tempdb and even provides a helpful script.  I’d like to add my own helpful tip for DBAs:  Schedule regular database backups!   Remember, you read it here first.

Previewing the upcoming SQL Server 2008, SQLTeam’s Derek discusses the Data Profiling Utility with SQL Server 2008.  It sure seems like a nice tool, but that still may not make people any less nervous about SQL 2008.  (Heck, my team is still nervous about SQL Server 2005!)   Aaron Betrand urges people to vote if you want IntelliSense in SSMS 2008 to also support SQL Server 2005, which seems like a great idea to me, and Linchi Shea has a quick analysis of SQL Server 2008 Page Compression and its performance impact on table scans.  Finally, Jamie Thomson dissects the fuzzyness of SQL Server 2008.   To me, “fuzziness” is what happens to my vision after drinking too many mojitos, but Jamie is discussing a new feature in SSIS 2008 so give it a read.

I’ve always believed that you don’t truly know all there is to know about databases until you understand the raw data structures of tables and indexes and so on.  Over at MSDN Channel 9, there’s a new series of videos on Data Structures and Algorithms, so be sure to watch if you want to know how database engines really work “under the hood”.  My enjoyment of the video was unfortunately interrupted by horrible flashbacks from my CS310 days.

Moving away from relational databases, Jim Wilson helps us to understand HBase and BigTable.  Apparently, HBase is the open source implementation of Google’s BigTable database, which is described as a “sparse, distributed, persistent multidimensional sorted map.”  In layman’s terms, that means “a database with wicked huge tables.”

For those looking for a laugh, be sure to read Andrew Calvett’s MS SQL Server Book of Wisdom.  It reminds me quite a bit of my infamous and widely misinterpreted Top 10 Things I Hate About SQL Server post from way back in the olden days.  Be careful, Andrew: sometimes folks don’t get it if your jokes are too subtle!  (Of course, in my case, it could be that my jokes just weren’t that funny.)

Well, that’s all for this week.  Thanks, Dave, for giving me an opportunity to write this week’s Log Buffer.  It was lot of fun and a welcome opportunity for me to spend more time than I usually do reading lots of great blog posts from around the internet.  Have a great weekend everyone!

Implementing “Interfaces” in SQL

My latest article has just been published over at SQLTeam:

    Implementing Table Interfaces

When I wrote a Table Inheritance article a few months back, the technique shown was pretty standard and straight-forward.  As I was writing it, I thought it would be an interesting challenge to figure out a way to implement table interfaces as well, where different tables don’t inherit from the same base class, but they still “implement” the same relations.  That definitely was not as easy, and the end result isn’t as clean and direct, but I hope this at least provides some ideas and at the very least it should provoke interesting comments and alternative approaches.

Need an Answer? Actually, No … You Need a Question

Welcome!

The reason you were directed here is because you need assistance, and I am here to help.  I am not, however, here to provide you with any answers!  You see, it looks like the assistance you need is not finding an answer; it is rather that you need assistance finding a question.

As you know, there are all kinds of questions.  Questions that test memory recall.  Questions that test logic skills.  Brain-teasers and mathematical questions and so on.  But there is one requirement that all good questions must have in common before they can be answered:

A proper question MUST provide ALL of the information necessary in order for an answer to be given. 

In other words, if you omit important information from a question, it doesn’t matter how simple or easy that question is:  It suddenly becomes very difficult, or even impossible, to answer. 

For example, consider the following question:

“Am I wearing a hat?”

Seems pretty easy, right? No logic, no memorization, no trivia, no knowledge of any specific topic is required. 

So … what’s the answer?  Take a few minutes, think about it, write it down on a piece of scrap paper.  I can wait, take your time …

What’s that, you say? You can’t answer that simple question!?  Why not?  I stated it very clearly, it requires a simple YES or NO response, there’s nothing tricky there.  So, why would anyone have any trouble giving an answer to something so basic and simple?

The reason, of course, is because you can’t see me.  You have no way of knowing what I am wearing because I did not provide you enough information!  As simple as it is, it cannot be answered; therefore, it is not a proper question!

Suppose, instead, I provided a picture of myself and asked “In this picture, am I wearing a hat?”  And, in the picture, my head is clearly visible and the fact that I am wearing a Boston Red Sox cap is very clear.  Would you be able to answer the question in that scenario?  Of course!  Suddenly, what was an impossible question to answer became very simple! 

How did that happen?

It happened, of course, because I provided you enough information to answer the question!  And that is the often problem with many of the questions we see day to day in forums asking for help.  You cannot expect an answer unless you provide a proper question with all the necessary information.  The majority of the effort by those helping others in these forums is not spent answering questions, it is spent trying to figure out what the heck the question actually is!  And that is the problem; people don’t seem to realize that they can’t just randomly cut and paste code or ask vague questions without any context and expect to receive help! 

Yet, requesting more information and details doesn’t always go over so well … Those looking for help seem to often have trouble understanding why the helpers need more info, why are they are so “anal” and “demanding” about minor things like database schemas, or sample data, or code samples.  “How is that important?” they wonder. “Just answer the question and provide me with some help, please!  An expert would know the answer!”

My goal today is to hopefully help you to understand why information and context is so important for even simple, basic, questions, and how providing that information suddenly transforms a poor question into a very good one that can be quickly and accurately answered. 

Let’s try another example:

“How do I get from work to Fenway Park?”

So, what do you think?  Is that a good question that provides all of the information?  It seems simple, right? It is just asking for basic directions. Yet, where is “work”?  Is it asking for driving directions, or walking directions, or maybe which subway lines to take?  Who knows!!?

As stated, this question simply cannot be answered!   Yet, if I just  thought about it a little and made sure to provide all of the necessary information, this “impossible” question with no answer suddenly becomes a very simple one:

“How I do get from 125 High Street, Boston, MA to Fenway Park via the subway?”

See the difference? Instead of just assuming that everyone knows where I work and what mode of transportation I am looking for, if I make sure to simply tell them, there is no uncertainty, no confusion, no guesswork, and the question can be answered.  This isn’t rocket science, right?  Yet, these common-sense basics seem to elude many, many, people!

Would you drop your car off at a mechanic with a note on it that says:

“Car doesn’t work.  Plz fix.  It is urgent! Thx!!”

I sure hope not.  You’d explain what’s wrong, right?  It has trouble starting, it has a flat tire, there’s smoke coming from beneath the hood, the steering wheel fell off, and so on. It’s basic common sense that you would do everything you can to be sure that the mechanic has the information he needs to fix your car correctly and promptly, right?  Shouldn’t that same logic also apply when asking for help in forums?

Finally, let’s try a SQL Server question:

“What is wrong with teh codez? it does not work!  Plz Help!



select SaleID, Customer, Qty, Price

from SalesNumbesr

Thnx!”

Take a look at that question.  It is a very simple SQL statement, right?  There’s nothing there that a beginner could not understand.   Can you “spot” the problem and fix it? 

Hmmm … maybe you can, maybe you can’t.  You can’t really be sure, can you?  At this point, we can all try to guess what the problem is.  What does “it does not work” even mean? Is “SalesNumbesr” a typo?  Should it be “SalesNumbers” ?  Is it returning too much data?  not enough data?  Incorrect data?  Is it generating an error?  And so on. 

We could spend all day trying to guess what the question is and provide answers to those guesses, but if the guesses are wrong, the answers won’t be so helpful, will they?  On the previous question, what if you guessed that I work in Cleveland and provided directions for me to Fenway Park from there?  Would that be helpful to me? Probably not, right?  Most likely, it just wasted everyone’s time.

So, getting back to the code …. what is wrong with it?  Well, in SQL terms, the answer is: NULL!  It does not exist.  Until we are provided with more information, the question cannot be answered. Thus, it is not a question at all, just an incomplete fragment.  As simple as the question looks, as basic as the T-SQL is, this “question” will stump even the greatest “experts” out there because an answer to this question simply does not exist.

If more information is provided, like this:

“Hi — I currently have the code below:



select SaleID, Customer, Qty, Price

from SalesNumbesr

I would also like to return the total Amount for each Sale, which is the Qty multiplied by the Price.  However, I am not sure how to add this to my current code.  Can anyone please help?”

Suddenly, the question now is very clear and the answer is very simple!  They just want to know how to add an expression to the result set. Just by providing a little more information, and not assuming that everyone knows what is happening outside of the context of what was written, something that was impossible to answer has become very easy. 

I sure hope this is making sense.

In fact, taking this whole article to its logical conclusion, I think we can safely say:

“The more accurate and detailed information a question provides, the more accurate and detailed the answers will be.”

In other words, a vague, incomplete question can only get, at best, vague, incomplete answers.  But a question that spells out the entire situation very clearly will get, at best, a very clear and specific answer that works in that situation. 

So, please, think of this when you ask questions in a forum.  Consider the fact that no one knows your specific environment, or code, or application, or database except for YOU.  And, no one can help you unless you are providing enough information for them to do so. 

Just like the mechanic. Or someone giving driving directions.  Or a doctor when you are sick.   You provide them with the necessary information so they can help you, right?  Consider doing the same to those providing you with (free!) programming advice.

. . .

(Feel free to provide this link to those who seen to have trouble understanding that you cannot read their mind when assisting them with programming help on forums.)

GROUP BY ALL

Here’s an obscure piece of SQL you may not be aware of:  The “ALL” option when using a GROUP BY.

Consider the following table:

Create table Sales

(

    SaleID int identity not null primary key,

    CustomerID int,

    ProductID int,

    SaleDate datetime,

    Qty int,

    Amount money

)



insert into Sales (CustomerID, ProductID, SaleDate, Qty, Amount)

select 1,1,’2008-01-01′,12,400 union all

select 1,2,’2008-02-25′,6,2300 union all

select 1,1,’2008-03-02′,23,610 union all

select 2,4,’2008-01-04′,1,75 union all

select 2,2,’2008-02-18′,52,5200 union all

select 3,2,’2008-03-09′,99,2300 union all

select 3,1,’2008-04-19′,3,4890 union all

select 3,1,’2008-04-21′,74,2840



SaleID      CustomerID  ProductID   SaleDate                Qty         Amount

———– ———– ———– ———————– ———– ———————

9           1           1           2008-01-01 00:00:00.000 12          400.00

10          1           2           2008-02-25 00:00:00.000 6           2300.00

11          1           1           2008-03-02 00:00:00.000 23          610.00

12          2           4           2008-01-04 00:00:00.000 1           75.00

13          2           2           2008-02-18 00:00:00.000 52          5200.00

14          3           2           2008-03-09 00:00:00.000 99          2300.00

15          3           1           2008-04-19 00:00:00.000 3           4890.00

16          3           1           2008-04-21 00:00:00.000 74          2840.00



(8 row(s) affected)

Suppose we’d like to see the customers that were sold Product #1 along with the total amount that they spent.

We would basically write a simple SELECT with a GROUP BY like this:

select CustomerID, sum(Amount) as TotalAmount

from Sales

where ProductID = 1

group by CustomerID

And sure enough, we’d get our answer:

CustomerID  TotalAmount

———– ———————

1           1010.00

3           7730.00



(2 row(s) affected)

Now, let’s say that we’d like to see all customers that have been sold any products, but we still just want to see the “TotalAmount” for ProductID #1.  For customers that have never ordered ProductID #1, it should output a “TotalAmount” value of $0.   One way to do this is with a CASE expression; instead of filtering so that only ProductID #1 is returned, we can conditionally SUM() the Amount only for orders for ProductID #1.  Like this:

select CustomerID, sum(case when ProductID=1 then Amount else 0 end) as TotalAmount

from Sales

group by CustomerID



CustomerID  TotalAmount

———– ———————

1           1010.00

2           0.00

3           7730.00



(3 row(s) affected)

That gives us the results we want.   Because we are not using a WHERE clause to filter the data, we see an entry for CustomerID #2 in the output. 

However, in situations where you have written the above SQL, you could actually replace the SUM(CASE…) expression by using GROUP BY ALL, instead of just a standard GROUP BY, like this:

select CustomerID, sum(Amount) as TotalAmount

from Sales

where ProductID = 1

group by all CustomerID



CustomerID  TotalAmount

———– ———————

1           1010.00

2           NULL

3           7730.00

Warning: Null value is eliminated by an aggregate or other SET operation.



(3 row(s) affected)

Notice that now all Customers are now returned, and a NULL is shown as the TotalAmount for Customer #2, who has no orders for ProductID #1 …  Even though though the WHERE clause seems to indicate that we should not be seeing customer #2 in the results!

The ALL option basically says “ignore the WHERE clause when doing the GROUPING, but still apply it for any aggregate functions”.   So, in this case, the WHERE clause is not considered when generating the population of CustomerID values, but it is applied when calculating the SUM.  This is very much like our first solution, where we removed the WHERE clause completely, and used a SUM(CASE…) expression to conditionally calculate the aggregate. 

Values that are excluded from the aggregation according to the WHERE clause have NULL values returned, as you can see in the result.  A simple ISNULL() or COALESCE() will allow us to return 0 instead of NULL:

select CustomerID, isnull(sum(Amount),0) as TotalAmount

from Sales

where ProductID = 1

group by all CustomerID



CustomerID  TotalAmount

———– ———————

1           1010.00

2           0.00

3           7730.00

Warning: Null value is eliminated by an aggregate or other SET operation.



(3 row(s) affected)

Notice that the warning about NULL being aggregated still displays, since that is the standard behavior in SQL Server when you calculate an aggregate on a NULL value.  You can turn these warnings off if you like for the during of the batch by issuing a set ANSI_WARNINGS off command before your SELECT.

GROUP BY ALL is kind of obscure and neat to know, but not really useful in most situations since there are usually easier or better ways to get this result.  For one thing, this won’t work if we want all Customers to be displayed, since a customer must have at least one order to show up in the result.  If we want to see all customers, even those that have never ordered, we would need to do a LEFT OUTER JOIN from the Customers table to our Orders aggregate SELECT:

create table Customers (CustomerID int primary key)

insert into Customers

select 1 union all

select 2 union all

select 3 union all

select 4



– Notice that we have 4 customers, but our Sales data has sales for only 3.



select c.customerID, isnull(s.TotalAmount,0) as TotalAmount

from Customers c

left outer join

    (select customerID, sum(Amount) as TotalAmount

     from Sales

     where ProductID = 1

    group by customerID) s on c.customerID = s.customerID

   

customerID  TotalAmount

———– ———————

1           1010.00

2           0.00

3           7730.00

4           0.00



(4 row(s) affected)

That is typically the standard way to return data for an entire population, regardless of existing transactions.  GROUP BY ALL gets us close, but if a new customer has never made an Order, they will never show up in the results.   Of course, depending on your needs, that may be what you want.

Another limitation is we can not use GROUP BY ALL if we want to return a grand total for all orders, along with the total just for ProductID #1.  For example, using the SUM(CASE…) expression along with a regular SUM(), we can do this:

select CustomerID, sum(case when ProductID=1 then Amount else 0 end) as Product1Amount,

    sum(Amount) as TotalAmount

from Sales

group by CustomerID



CustomerID  Product1Amount        TotalAmount

———– ——————— ———————

1           1010.00               3310.00

2           0.00                  5275.00

3           7730.00               10030.00



(3 row(s) affected)

That lets us calculate two different totals all in one pass through the table.  However, we cannot translate that using GROUP BY ALL, because while we will be able to return the Product1Amount, there would be no easy way to also get the TotalAmount for all products without an additional join or sub-query.

. . .

So, that’s the story with GROUP BY ALL. It is interesting, and not widely well-known, and may even make for a good interview question if you really want to see how much SQL a candidate knows.  But for practical purposes, it is pretty rarely used and there are generally better ways to get the same results more easily or more efficiently.

Anyone have a good situation or an example of where GROUP BY ALL really worked well for you?  Be sure to share your experiences in the comments.

UNPIVOT: Normalizing data on the fly

Everyone seems to want to “pivot” or “cross tab” data, but knowing how to do the opposite is equally important.  In fact, I would argue that the skill of “unpivoting” data is more useful and more important and more relevant to a SQL programmer, since pivoting results in denormalized data, while unpivoting can transform non-normalized data into a normalized result set.  We all know that there’s lots of bad databases designs out there, so this can be a handy technique to know. 

Of course, even a well designed, fully normalized database can still benefit from “unpivoting” from time to time, so let’s take a look at some common situations and some of the options we have to handle this at our disposal.  We will focus on some traditional SQL techniques to do this, and then take a close look at the UNPIVOT operator that was introduced with SQL Server 2005.

Example #1:  A Bad database design

Let’s start with a commonly bad table design, in which someone has decided to relate a client to multiple contacts by designing their client table like this:

create table Clients
(  
    clientID int primary key,
    clientName varchar(100),
    contact1 int,
    contact2 int,
    contact3 int,
    contact4 int
)




insert into Clients

select 1,’ABC Corp’,1,34,2,null union all

select 2,’DEF Foundation’,6,2,8,9 union all

select 3,’GHI Inc.’,5,9,null,null union all

select 4,’XYZ Industries’,24,null,6,null



clientID    clientName           contact1    contact2    contact3    contact4

———– ——————– ———– ———– ———– ———–

1           ABC Corp             1           34          2           NULL

2           DEF Foundation       6           2           8           9

3           GHI Inc.             5           9           NULL        NULL

4           XYZ Industries       24          NULL        6           NULL



(4 row(s) affected)

(Note: For brevity, I am not including the contact table here, nor the foreign key constraints.  Of course, with this table design, it would probably be pretty unlikely to find such constraints in the database anyway)


With this design, it is not very easy or efficient to get a count of all contacts for each client, or to find out which contacts are related to which clients.   One thing we can do, however is to “unpivot” this table in a query that returns 1 row per ClientID/ContactID combination.  With that result set, we can easily now reference the table as if it were normalized and we can get the information we need.

One way to do is to use UNION ALL to return each row in the clients table 4 times, and each time return a different contactID column:

select clientID, contact1 as ContactID

from clients

where contact1 is not null

union all

select clientID, contact2 as ContactID

from clients

where contact2 is not null

union all

select clientID, contact3 as ContactID

from clients

where contact3 is not null

union all

select clientID, contact4 as ContactID

from clients

where contact4 is not null



clientID    ContactID

———– ———–

1           1

2           6

3           5

4           24

1           34

2           2

3           9

1           2

2           8

4           6

2           9



(11 row(s) affected)


Another option is to CROSS JOIN the Clients table with a table or resultset that returns 4 rows, which also effectively returns each row in the clients table 4 times.  For each of the 4 values in the table we are cross joining, we grab a different contact column:

select *

from

(

    select c.clientID,

       case n.n when 1 then c.contact1

            when 2 then c.contact2

            when 3 then c.contact3

            when 4 then c.contact4 end as ContactID

    from

        clients c

    cross join

        (select 1 as n union all select 2 union all select 3 union all select 4) n

)   

    x

where

    x.ContactID is not null



clientID    ContactID

———– ———–

1           1

1           34

1           2

2           6

2           2

2           8

2           9

3           5

3           9

4           24

4           6



(11 row(s) affected)


(Note that you can use a permanent table of Numbers in your database instead of generating it on the fly with a UNION, as shown)

Finally, however, there is an even eaiser way to handle this: the UNPIVOT operator, new with SQL 2005.  UNPIVOT works very efficiently and really allows you to handle this exact situation quite easily:

select clientID, Contact.ContactID

from clients

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact



clientID    ContactID

———– ———–

1           1

1           34

1           2

2           6

2           2

2           8

2           9

3           5

3           9

4           24

4           6



(11 row(s) affected)

Much shorter to write, and more efficient to execute as well. 

Taking a Closer Look at UNPIVOT

The UNPIVOT operator is tricky to get a feel for, however, so let’s take a look at it.

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact



First, the “As Contact” at the end is just labeling the entire unpivot result set with an alias, just as you must alias a derived table.  Each column returned by the pivot operator can be referenced by the alias if necessary.

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact

The “ContactID for” part says that we want to return a column called “ContactID” for each unpivoted row.  The IN() list is the columns that we are unpivoting; the values in the 4 columns listed here will be assigned to the ContactID column in the result.  So, the first time a particular row is unpivoted, the value of the ‘contact1′ column is assigned to ContactID, the next time it is the ‘contact2′ column, then ‘contact3′, and then finally ‘contact4′.  Then, the next row is processed and it all begins again. 

Thus, because we are unpivoting 4 values, we know that the result of the unpivot will have 4 times as many rows as the source data.

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact

UNPIVOT returns an additional column as well, which contains the name of the column that was used to produce each unpivoted row.   Here, we have specified that to be called ContactNumber.  Note that we actually did not return ContactNumber in our example, be we can easily add that in so you can see how it works:

select clientID, Contact.ContactNumber, Contact.ContactID

from clients

unpivot (ContactID for ContactNumber in (contact1, contact2,contact3,contact4)) as Contact



clientID    ContactNumber           ContactID

———– ———————– ———–

1           contact1                1

1           contact2                34

1           contact3                2

2           contact1                6

2           contact2                2

2           contact3                8

2           contact4                9

3           contact1                5

3           contact2                9

4           contact1                24

4           contact3                6



(11 row(s) affected)

So, you can see that the code to write is very short, but a little difficult to grasp at first.  In the end, though, we are able to take a bad table design and easily “fix it”, at least temporarily, so that we can query it using simple and standard SQL statements to get what we need.

Example #2:  Normalizing a Transaction Table

Here’s another common example:

create table Transactions
(


    TranDate datetime,

    Account varchar(10),

    BudgetAmount money,

    ActualAmount money,

    ProjectionAmount money,

    primary key (TranDate, Account)

)

go

insert into Transactions

select ‘2008-01-01′,’0001′,354,65,58 union all

select ‘2008-01-02′,’0001′,14,65,34 union all

select ‘2008-01-03′,’0001′,0,65,622 union all

select ‘2008-01-04′,’0001′,9,32,84

go



TranDate                Account    BudgetAmount          ActualAmount          ProjectionAmount

———————– ———- ——————— ——————— ———————

2008-01-01 00:00:00.000 0001       354.00                65.00                 58.00

2008-01-02 00:00:00.000 0001       14.00                 65.00                 34.00

2008-01-03 00:00:00.000 0001       0.00                  65.00                 622.00

2008-01-04 00:00:00.000 0001       9.00                  32.00                 84.00



(4 row(s) affected)

Notice that we have different columns for Budget, Actual and Projection, which is not really a great database design.  Much better would be to break this data out so that we have a single ‘Amount’ column and a ‘TransactionType’ column that specifies the type of each transaction.  We can transform our Transactions tables into this format using UNION ALL:

select TranDate, Account, ‘BudgetAmount’ as Type, BudgetAmount as Amount from transactions

union all

select TranDate, Account, ‘ActualAmount’ as Type, ActualAmount as Amount from transactions

union all

select TranDate, Account, ‘ProjectionAmount’ as Type, ProjectionAmount as Amount from transactions



TranDate                Account    Type             Amount

———————– ———- —————- ———————

2008-01-01 00:00:00.000 0001       BudgetAmount     354.00

2008-01-02 00:00:00.000 0001       BudgetAmount     14.00

2008-01-03 00:00:00.000 0001       BudgetAmount     0.00

2008-01-04 00:00:00.000 0001       BudgetAmount     9.00

2008-01-01 00:00:00.000 0001       ActualAmount     65.00

2008-01-02 00:00:00.000 0001       ActualAmount     65.00

2008-01-03 00:00:00.000 0001       ActualAmount     65.00

2008-01-04 00:00:00.000 0001       ActualAmount     32.00

2008-01-01 00:00:00.000 0001       ProjectionAmount 58.00

2008-01-02 00:00:00.000 0001       ProjectionAmount 34.00

2008-01-03 00:00:00.000 0001       ProjectionAmount 622.00

2008-01-04 00:00:00.000 0001       ProjectionAmount 84.00



(12 row(s) affected)

Or, we can use the UNPIVOT operator to do the same much easier:

select TranDate, Account, Type, Amount

from Transactions

unpivot (Amount for Type in (BudgetAmount, ActualAmount, ProjectionAmount)) as Amount


TranDate                Account   Type                     Amount

———————– ——— ———————— ———————

2008-01-01 00:00:00.000 0001      BudgetAmount             354.00

2008-01-01 00:00:00.000 0001      ActualAmount             65.00

2008-01-01 00:00:00.000 0001      ProjectionAmount         58.00

2008-01-02 00:00:00.000 0001      BudgetAmount             14.00

2008-01-02 00:00:00.000 0001      ActualAmount             65.00

2008-01-02 00:00:00.000 0001      ProjectionAmount         34.00

2008-01-03 00:00:00.000 0001      BudgetAmount             0.00

2008-01-03 00:00:00.000 0001      ActualAmount             65.00

2008-01-03 00:00:00.000 0001      ProjectionAmount         622.00

2008-01-04 00:00:00.000 0001      BudgetAmount             9.00

2008-01-04 00:00:00.000 0001      ActualAmount             32.00

2008-01-04 00:00:00.000 0001      ProjectionAmount         84.00



(12 row(s) affected)

Example #3: “Unsummarizing” Data

For those who work with accounting systems, this example may be familiar to you.   Many times, Accounting systems have “summary” tables that roll up transactional data into a structure like this:

create table AccountBalances

(

    CompanyID int,

    AccountID int,

    TransactionTypeID int,

    Year int,

    Period1 money,

    Period2 money,

    Period3 money,

    Period4 money,

    Period5 money,

    Period6 money,

    Period7 money,

    Period8 money,

    Period9 money,

    Period10 money,

    Period11 money,

    Period12 money

)



insert into AccountBalances

select 1,1,1,2008,200,300,400,500,400,0,0,0,0,0,0,0 union all

select 1,2,1,2008,100,100,100,100,100,100,100,0,0,0,0,0 union all

select 1,3,1,2008,150,0,50,10,10,200,400,45,0,0,0,0

(As before, let’s not worry about those foreign key constraints)

These tables are often calculated when transactions are posted, or periods are closed.   Typically, many reports pull from these tables because it is much more efficient than summarizing thousands or millions of transactions, and the data is already “cross-tabbed” the way most reporting tools would like to display it.

We can take this summarized data and “unpivot” it so we can still access the summarized data, but now it will be in a normalized structure.  All it takes is a simple UNPIVOT like this:

select
       CompanyID,
       AccountID,
       TransactionTypeID,
       Year,
       substring(Period,7,2) as PeriodNo,
       Amount


from
       AccountBalances


unpivot
  (Amount for Period in (Period1,Period2,Period3,Period4,Period5,Period6,
                         Period7,Period8,Period9,Period10,Period11,Period12)
   ) as Amount




CompanyID   AccountID   TransactionTypeID Year        PeriodNo Amount

———– ———– —————– ———– ——– ———————

1           1           1                 2008        1        200.00

1           1           1                 2008        2        300.00

1           1           1                 2008        3        400.00

1           1           1                 2008        4        500.00

1           1           1                 2008        5        400.00

1           1           1                 2008        6        0.00

1           1           1                 2008        7        0.00

1           1           1                 2008        8        0.00

1           1           1                 2008        9        0.00

1           1           1                 2008        10       0.00

1           1           1                 2008        11       0.00

1           1           1                 2008        12       0.00

1           2           1                 2008        1        100.00

1           2           1                 2008        2        100.00

1           2           1                 2008        3        100.00

1           2           1                 2008        4        100.00

1           2           1                 2008        5        100.00

1           2           1                 2008        6        100.00

1           2           1                 2008        7        100.00

1           2           1                 2008        8        0.00

1           2           1                 2008        9        0.00

1           2           1                 2008        10       0.00

1           2           1                 2008        11       0.00

1           2           1                 2008        12       0.00

1           3           1                 2008        1        150.00

1           3           1                 2008        2        0.00

1           3           1                 2008        3        50.00

1           3           1                 2008        4        10.00

1           3           1                 2008        5        10.00

1           3           1                 2008        6        200.00

1           3           1                 2008        7        400.00

1           3           1                 2008        8        45.00

1           3           1                 2008        9        0.00

1           3           1                 2008        10       0.00

1           3           1                 2008        11       0.00

1           3           1                 2008        12       0.00



(36 row(s) affected)

We can filter so that only periods with a non-zero amount are included, and we can SELECT FROM this result set and get the exact data we need for whatever date range we want without worrying which column the actual data is in.

This can also be done with a CROSS JOIN or a UNION ALL, but with 12 values to pivot, those options become much longer to write and UNPIVOT appears to be the way to go in this case.

Example #4: Multiple unpivots

Finally, let’s consider a more complicated example.  Here, we have a table that stores games played between two teams, where one team is the HomeTeam and the other is the AwayTeam:

create table Teams

(

    TeamCode char(3) primary key not null,

    TeamName varchar(100) not null

)



create table Games

(

    GameDate datetime,

    HomeTeam char(3) references Teams(TeamCode),

    AwayTeam char(3) references Teams(TeamCode),

    HomeScore int,

    AwayScore int,

    primary key (GameDate, HomeTeam),

    constraint pk2 unique (GameDate, AwayTeam),

    check (HomeTeam <> AwayTeam)

)



insert into Teams

select ‘BOS’,'Boston Red Sox’ union all

select ‘NYY’,'New York Yankees’



insert into Games

select ‘2008-04-01′,’BOS’,'NYY’,3,1 union all

select ‘2008-04-02′,’BOS’,'NYY’,6,4 union all

select ‘2008-04-03′,’BOS’,'NYY’,2,3 union all

select ‘2008-04-08′,’NYY’,'BOS’,6,0 union all

select ‘2008-04-09′,’NYY’,'BOS’,2,6 union all

select ‘2008-04-10′,’NYY’,'BOS’,1,10



GameDate                HomeTeam AwayTeam HomeScore   AwayScore

———————– ——– ——– ———– ———–

2008-04-01 00:00:00.000 BOS      NYY      3           1

2008-04-02 00:00:00.000 BOS      NYY      6           4

2008-04-03 00:00:00.000 BOS      NYY      2           3

2008-04-08 00:00:00.000 NYY      BOS      6           0

2008-04-09 00:00:00.000 NYY      BOS      2           6

2008-04-10 00:00:00.000 NYY      BOS      1           10



(6 row(s) affected)

That may or may not be the best possible design for the Games table, but it is a common way to do it.  With data in that form, if we want to get the total runs scored per team across all games, or get each teams won-loss record, we need to make two passes through the table. This can be done fairly easily and efficiently with a union:

select GameDate,

    HomeTeam as TeamCode,

    ‘Home’ as HomeOrAway,

    HomeScore as Score,

    case when HomeScore > AwayScore then 1 else 0 end as Win,

    case when HomeScore < AwayScore then 1 else 0 end as Loss,

    case when HomeScore = AwayScore then 1 else 0 end as Tie

from

    Games

union all

select GameDate,

    AwayTeam as TeamCode,

    ‘Away’ as HomeOrAway,

    AwayScore as Score,

    case when HomeScore < AwayScore then 1 else 0 end as Win,

    case when HomeScore > AwayScore then 1 else 0 end as Loss,

    case when HomeScore = AwayScore then 1 else 0 end as Tie

from

    Games

   

GameDate                TeamCode HomeOrAway Score       Win         Loss        Tie

———————– ——– ———- ———– ———– ———– —–

2008-04-01 00:00:00.000 BOS      Home       3           1           0           0

2008-04-02 00:00:00.000 BOS      Home       6           1           0           0

2008-04-03 00:00:00.000 BOS      Home       2           0           1           0

2008-04-08 00:00:00.000 NYY      Home       6           1           0           0

2008-04-09 00:00:00.000 NYY      Home       2           0           1           0

2008-04-10 00:00:00.000 NYY      Home       1           0           1           0

2008-04-01 00:00:00.000 NYY      Away       1           0           1           0

2008-04-02 00:00:00.000 NYY      Away       4           0           1           0

2008-04-03 00:00:00.000 NYY      Away       3           1           0           0

2008-04-08 00:00:00.000 BOS      Away       0           0           1           0

2008-04-09 00:00:00.000 BOS      Away       6           1           0           0

2008-04-10 00:00:00.000 BOS      Away       10          1           0           0



(12 row(s) affected)

Now, how can we do this with UNPIVOT?  This example is a bit more complicated, because we are “unpivoting” not only the score, but also the TeamCode, and we are also calculating a few extra columns (Win, Loss and Tie).

So, can we UNPIVOT more than 1 column?  Let’s start simply and do things one at a time.  First, let’s UNPIVOT just the TeamCode:

select GameDate, HomeOrAway, Team

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team



GameDate                HomeOrAway     Team

———————– ————– —-

2008-04-01 00:00:00.000 HomeTeam       BOS

2008-04-01 00:00:00.000 AwayTeam       NYY

2008-04-02 00:00:00.000 HomeTeam       BOS

2008-04-02 00:00:00.000 AwayTeam       NYY

2008-04-03 00:00:00.000 HomeTeam       BOS

2008-04-03 00:00:00.000 AwayTeam       NYY

2008-04-08 00:00:00.000 HomeTeam       NYY

2008-04-08 00:00:00.000 AwayTeam       BOS

2008-04-09 00:00:00.000 HomeTeam       NYY

2008-04-09 00:00:00.000 AwayTeam       BOS

2008-04-10 00:00:00.000 HomeTeam       NYY

2008-04-10 00:00:00.000 AwayTeam       BOS



(12 row(s) affected)

OK, so far so good.  Now, how do we get the score for the team as well?  Let’s add another UNPIVOT clause to the SELECT, this time for Score:

select GameDate, HomeOrAway, Team, Score

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team

unpivot (Score for HomeOrAway in (HomeScore, AwayScore)) as Score



Msg 265, Level 16, State 1, Line 1

The column name “HomeOrAway” specified in the UNPIVOT operator conflicts with the existing column name in the UNPIVOT argument.

Msg 8156, Level 16, State 1, Line 1

The column ‘HomeOrAway’ was specified multiple times for ‘Score’.

Hmmm. OK, we cannot specify the same column for both pivots.  That is just an alias specification, no big deal, so let’s just alias it as “HomeOrAway2″ :

select GameDate, HomeOrAway, Team, Score

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team

unpivot (Score for HomeOrAway2 in (HomeScore, AwayScore)) as Score



GameDate                HomeOrAway Team Score

———————– ———- —- ———–

2008-04-01 00:00:00.000 HomeTeam   BOS  3

2008-04-01 00:00:00.000 HomeTeam   BOS  1

2008-04-01 00:00:00.000 AwayTeam   NYY  3

2008-04-01 00:00:00.000 AwayTeam   NYY  1

2008-04-02 00:00:00.000 HomeTeam   BOS  6

2008-04-02 00:00:00.000 HomeTeam   BOS  4

2008-04-02 00:00:00.000 AwayTeam   NYY  6

2008-04-02 00:00:00.000 AwayTeam   NYY  4

2008-04-03 00:00:00.000 HomeTeam   BOS  2

2008-04-03 00:00:00.000 HomeTeam   BOS  3

2008-04-03 00:00:00.000 AwayTeam   NYY  2

2008-04-03 00:00:00.000 AwayTeam   NYY  3

2008-04-08 00:00:00.000 HomeTeam   NYY  6

2008-04-08 00:00:00.000 HomeTeam   NYY  0

2008-04-08 00:00:00.000 AwayTeam   BOS  6

2008-04-08 00:00:00.000 AwayTeam   BOS  0

2008-04-09 00:00:00.000 HomeTeam   NYY  2

2008-04-09 00:00:00.000 HomeTeam   NYY  6

2008-04-09 00:00:00.000 AwayTeam   BOS  2

2008-04-09 00:00:00.000 AwayTeam   BOS  6

2008-04-10 00:00:00.000 HomeTeam   NYY  1

2008-04-10 00:00:00.000 HomeTeam   NYY  10

2008-04-10 00:00:00.000 AwayTeam   BOS  1

2008-04-10 00:00:00.000 AwayTeam   BOS  10



(24 row(s) affected)

Holy schnikies, it works!  We can specify more than one UNPIVOT clause for the same SQL statement!  Who would have thunk it?

Uh oh — wait a second.  We have 24 results returned.  We should have only 12.  Something is not right here.  Remember when we aliased our second unpivot column as “HomeOrAway2″?  We did not return that anywhere in our results.  Let’s add that in and take a look:

select GameDate, HomeOrAway, HomeOrAway2, Team, Score

from Games

unpivot (Team for HomeOrAway in (HomeTeam, AwayTeam)) as Team

unpivot (Score for HomeOrAway2 in (HomeScore, AwayScore)) as Score



GameDate                HomeOrAway HomeOrAway2 Team Score

———————– ———- ———– —- ———–

2008-04-01 00:00:00.000 HomeTeam   HomeScore   BOS  3

2008-04-01 00:00:00.000 HomeTeam   AwayScore   BOS  1

2008-04-01 00:00:00.000 AwayTeam   HomeScore   NYY  3

2008-04-01 00:00:00.000 AwayTeam   AwayScore   NYY  1

2008-04-02 00:00:00.000 HomeTeam   HomeScore   BOS  6

2008-04-02 00:00:00.000 HomeTeam   AwayScore   BOS  4

2008-04-02 00:00:00.000 AwayTeam   HomeScore   NYY  6

2008-04-02 00:00:00.000 AwayTeam   AwayScore   NYY  4

2008-04-03 00:00:00.000 HomeTeam   HomeScore   BOS  2

Minimize a DropDownList’s ViewState

Let’s say you have a very large DropDownList with lots of values and text.  We need to maintain ViewState in this DropDownList so that we can retrieve the selected value on a post back.   Of course, this means that now the ViewState contains the data for every single value in the list, both values and text included.   Even though the page itself may be fairly simple and lightweight, the result of having this simple DropDownList on the page is that the page size is quite large and the amount of data passed back and forth on a postback is very large as well.

If you have an efficient database, and/or if you are caching the data anyway, you might not mind re-loading the list items each time the page posts back to eliminate the need for the page itself to hold all of this ViewState data. 

However, if you turn off ViewState on the DropDownList, you will notice that it now does not remember the selected value on post backs.  The solution is very simple — just manually set the DropDownList after re-loading it to the value from the HTML form post.

To do this, first we disable ViewState on the DropDownList (EnableViewState=”False”).  Then, instead of a typical PageLoad() method like this:

 

        protected void Page_Load(object sender, EventArgs e)

        {

            if (!IsPostBack)

            {

                loadList();              

                // other stuff here

            }

        }

      
we would load the list every time the page posts back:

        protected void Page_Load(object sender, EventArgs e)

        {

            loadList();     

 

            if (!IsPostBack)

            {

                 // other stuff here

            }

        }

       
And, in that loadList() method, instead of simply doing this:

        void loadList()

        {

            dlTest.DataSource = // get data from DB

            dlTest.DataBind();

        }

       
you also would check to see if it has a posted value:

        void loadList()

        {

            dlTest.DataSource = // get data from DB

            dlTest.DataBind();

 

            string r = Request[dlTest.UniqueID];

            if (r != null)

                dlTest.SelectedValue = r;

        }

That loads the DropDownList each time and ensures that the value set is what was posted back.  Thus, the control will now “remember” the selected value on each postback, but without requiring any data stored in the ViewState at all. 

It would also be easy to implement this logic in a control that inherits from DropDownList; say, a LightDropDownList.  Techniques like this will of course work on other controls as well, such as a ListBox or a CheckedList.

With fast database connections and performance, as well as data caching, sometimes persisting data on the page is not the optimal way to go.  Like always, there are different ways to skin a cat and sometimes simple little tweaks like this can have huge benefits on performance.

Update:  See the comments for other options are that even easier/better.  Also in the comments, Richard has provided a link to this excellent post on ViewState that I highly recommend checking out.  It’s long, but very informative and quite entertaining as well.   Thanks for the great feedback.