Archive for June, 2008

The Golden Rule of Data Manipulation

Introduction

There is a very simple rule when it comes to storing (and returning) data, which I see violated all the time, making life so much more complicated for everyone involved.  In case you haven’t noticed, that’s a common theme I discuss here on this blog — different ways programmers make life more difficult for themselves, instead of simply following good practices and doing things the easy way.  This is yet another example of that situation.

The “Golden Rule of Data Manipulation” is a simple, but important rule that you should always follow when designing a database,  writing database code, or really writing any application code at all for that matter:

“It is always easier and more flexible to combine data elements rather than to break them apart”

In other words: Concatenation is easy. Parsing is hard.  Often, very hard — or even impossible depending on the data.

Problems with Parsing

It is amazing how often I see people struggling with “difficult SQL problems” such as:

  • Working with CSV lists of values in a single column, such as “1,3,56,2″
  • Breaking out a FirstName/MiddleName/LastName/Suffix from a single “Name” column
  • Parsing address strings into City/State/ZIP, or Number/Street/Unit
  • Parsing Phone Numbers to get just an area code, or to take different phone formats and present them all uniformly
  • Figuring out how to calculate the Day,Month, and/or Year from different string values such as “23-Jan-08″, “2008-02″, “20070303″, “03032007″

And on and on it goes….

Now, sometimes you inherit or import data that needs to be parsed — that’s a fact of life.  You’ve got to figure out how to do it, and the key in those cases is to accept that because the data itself is essentially random, nothing you can write will perfectly work 100% of the time on all of it.  Often, the best you can do is handle most of the data, and then do some manual clean up. 

Parsing strings can be a very difficult task for any programmer, and the challenge isn’t writing the code, it’s coming up with the algorithm (another common theme on this blog).  Consider my new favorite example of why parsing a single Name column into a First/Middle/Last is not as easy as it seems:

    Oscar De La Hoya

How would your algorithm parse that one?  Never mind prefixes such as “Dr.” and suffixes such as “Jr.”! 

Please don’t interpret what I am saying as a programming challenge — I understand that it is possible to write long code with a list of exceptions or rules and have that algorithm work pretty well in most cases.  The point is that writing that algorithm is a lot of work, running it will be inefficient, and it will never be exact because the data itself that is being processed is essentially random.  It’s just like the old saying: “garbage in, garbage out”.  Still one of my favorites, after all these years, and it still applies!

A Data Model that requires Parsing = A Poor Data Model

So, we need to accept that sometimes you’ve got to parse data like this.  And that’s OK; it happens, it can be done, even if some manual work is often involved.

However, there’s no excuse when you design your database, your SQL code, or applications so that free-form data must be parsed, when you can simply design it correctly in the first place and store your data already broken out into the smallest possible units with the correct data types.

If breaking out a contact’s name into First, Last, Middle, etc is important to your application, then you should force the point of data entry to accept input broken out into those columns.  The same goes for phone numbers, addresses, and so on.  Any time you have the option of accepting  input as clean, short, raw, segments of data you should always do it.   Once you have data at that smaller resolution, it is trivial to combine it any way that you want for presentation, formatting, filtering, and so on.

It may seem like overkill to break out a phone number into 4 columns:

AreaCode
Exchange
Number
Extension

   
And, in fact, it might be more complicated than that if you need to deal with international phone numbers.  You may look at your tables, and your code, and even the UI that accepts these fields and think “that is way too precise and unnecessary, breaking out phone numbers like this sure makes things complicated!”

But by doing this, and only accepting user input that follows precise rules of what is allowed in these fields, and storing each of them in their own column, you can now easily and efficiently:

  1. Sort these numbers any way you want, without worrying about extra characters like parenthesis or dashes, or leading 1s, messing things up
  2. Filter quickly on an area code without the need to use LIKE, and again worrying about extra characters getting in the way
  3. Present the phone number quickly and easily any way you want without any parsing, be it as 123.123.1345 x123 or “(123) 123-1345 extension 123″, or anything you want.
  4. Validate your phone numbers, ensuring you have all the necessary parts and they are the proper length, without worrying about parsing strings

Considering doing any of those things if your data is stored in random strings like:

1-123-124-1234
(123) 124-1234
123-124-1234, ex. 123
123.124.1234 x123
(123)124.1234 ext. 123
1 123 124 1234 123

   
and so on …  Not so easy in that case, just as parsing simple “Name” columns into First/Last, or addresses into  Number/Street/Unit is not so easy as well. 

Again, this is not a programming challenge — I am sure it can done. (In fact, phone numbers are generally the easiest because you can usually just ignore anything other than digits.)  Most of us have done it before.  But designing something in such a way that parsing is required to do simple filtering, sorting, or formatting, is a bad design

It’s Not About the UI

As I wrote here, you should never think “I want to display phone numbers like 123.123.1234, so I should store them and return them that way.”  You should always think “How can I break this down into small, concrete parts that are easily validated and easy to combine any way I want at any time?”

So, what if you need fine detail when storing addresses, but you don’t want your UI to present Street Number, Street Name, Unit Type, Unit Number as different data entry fields for usability or aesthetic reasons?  That’s fine, but that doesn’t mean you should not set up your database properly.  Your UI can certainly still present that one single “Address” text box for the user to fill out, parse that text at data entry, show the user the parsed result in multiple fields, and ask “Please verify for your address” or something along those lines.   Then, if not, the user can tweak the results and save it.  If you do things along those lines, and focus on getting the data parsed and stored correctly at the earliest point possible, every other part of your code will be that much more efficient.

All of this applies not only to data storage, but to how data is returned and passed between tiers as well.  Again, if you just return separate columns to your client application, instead of focusing on making them “look nice” in your database code by returning nothing but long, “pre-formatted” strings, your client can simply concatenate and format those columns any way it needs.  And, different clients can format that same database output in different ways — all without ever altering any database code! 

Conclusion

In short, remember that writing concatenation is easy, efficient, and exact.  Writing a parsing routine, on the other hand, is often none of those things.   You may not always be able to control the design of the data you are working with, but be sure that when you can, you do it right.  If you find yourself using lots of LIKE expressions, or string parsing for simple data retrieval operations, something is wrong.   Time to fix up your database and your code, store the parsed and validated data permanently, and make things easier and cleaner for everyone.

Whether you are designing a schema, writing a SELECT, or writing code in any other programming language, remember that the Golden Rule of Data Manipulation always applies.  Accept this rule, learn from it, and practice it, and you might be surprised to find that programming isn’t quite as hard as you thought it was.

The Joy of Blog Feedback

Introduction

I have been writing my little blog here for some time now, and my favorite part of doing this is of course the feedback.  It’s always great to hear from the readers, to have mistakes corrected, to debate various topics and techniques, and to learn a lot about SQL and the various topics I discuss here. 

At this point, I have received over 1,700 comments over the years, and while all of them are truly appreciated, I have noticed that unfortunately many of the, uh, less helpful comments do seem to consistently fall neatly into various categories. 

Let’s take a look at an example of a simple, typical blog post and some of the responses that often come back.  If you write a blog of your own, or often read the feedback from other blogs, many of these may seem familiar to you. 

A Typical Blog Post


Today, I have a simple tip for beginner SQL programmers.  When writing a SELECT, you can add a WHERE clause to filter the results that are returned.

For example, to only return rows for CustomerID 345, you can write:

SELECT …
FROM YourTable
WHERE CustomerID = 345

As you can see, it is very simple. You can use any boolean expression to filter the results as needed. Try it out!  If you have any questions, let me know.

Some Typical Responses

The subtle blog spammer (that you initially mistake for a nice complement):

Very helpful site! Good advice!  From Joe at www.some-random-site-that-has-nothing-to-do-with-sql.com

The person who doesn’t seem to get it:

Ummm … what good does this do when I want to sort? You should fix the code.

The person that really doesn’t get it:

The problem with that is it will only return results for one customer.

The person that somehow takes away the exact opposite of what you wrote:

I disagree, this will not return all customers other than 345 and this is definitely not something for advanced SQL Programmers, it’s probably better for beginners.

The script kiddie (who just wants to cut and paste your code, not read or learn anything):

LOL, that doesnt even run 4 me!  I get errorz that sez “YourTable” does not exist!  Plz help!!  thnx!

The very clichéd, mindless “anti-Microsoft ranter”:

You only have to use WHERE clauses because Bill Gates wants more $$, you are a shill!! Micro$oft sucks, you should use an iPhone for this!  MySQL automagically filters results for you!

The “skimmer” (who just skims the post missing most of it):

Nice, but is there any way to filter for just one customer?

The “repeater” (who just repeats what you’ve already written):

A better solution is to write WHERE CustomerID = 345, it works better.  It is also fast because less rows are returned.  Using WHERE is a good way to filter a SELECT.

The “know-it-all complainer”:

That is the stupidest advice I ever read, why would you want to ever do this? Just use a parameter, or an ORM tool– this will not scale!  I sure hope CustomerID isn’t a VARCHAR — then you have an implicit conversion happening, your indexes are shot, your server will overheat, and your wife will leave you for your mechanic.  Also, 345 is too large if CustomerID is a tinyint.

The random, unrelated question asker:

Good advice. Thnx.  How to insert into the table?

The “misunderstander”:

If I add this to all of my scripts, only data for one customer will ever be returned.  I am not sure this is a good idea. Also, this code will not work in Java and doesn’t follow the HTML 4.0 specification. 

The very rare polite and helpful typo alerter:

Hey there, you have a typo in the first sentence — should be “filter”, not “fitler”! Just letting you know, thanks for a great post!

The much more common typo alerter:

You wrote FITLER not FILTER, your an idiot!! if you cannot write English how can you write SQL ???   Learn to spell!

Summary

Please, don’t misunderstand, I mean this all in good fun.  I love feedback, and please, keep it coming.  It’s what makes this and every other blog a fun place to visit.

In fact, I realize that I left out the most annoying feedback of all!  That’s right, the Thin-Skinned, Overly-Defensive Blog Author Who Feels the Need to Respond to Everything:

Did you even read what I wrote? I did not say that.  And, yes, I did spell “monkey” wrong, so sue me!  Remind me to fire my editor…. or maybe I should refund your subscription fee?  Oh, wait, this blog is free!  So what the heck are you complaining about?  Why don’t you go bother some MySQL blogger?  I hear they usually write at a 5th grade level which is probably more appropriate for your intellect. Jerk!

Yeah, comments like those are definitely the worst of all!  Thank you for putting up with my feedback, now that I think of it!

The Truth about “Cursor Busting” in SQL

Let’s say you are called in to troubleshoot a stored procedure that is performing poorly.

You dive in to investigate and this is what you find:

create procedure ProcessProducts

as

    declare @Products cursor, @ProductID int

    set @Products = cursor for select ProductID from Products order by ProductID

    open @Products



    fetch next from @Products into @ProductID



    while (@@FETCH_STATUS=0)

        begin

        exec DoSomething @ProductID
        fetch next from @Products into @ProductID

        end



    deallocate @Products

Ah ha! A cursor!  It seems we have identified the bottleneck: Clearly, the performance problems are because the code is not doing things in a set-based manner, but rather by processing rows one at a time using a dreaded cursor.  This cursor is opening up the Products table, looping through the rows one at a time, and calling the “DoSomething” stored procedure for each ProductID.  As we all know, cursors are not the way to go when writing SQL code; this cursor should eliminated and replaced with a cleaner, more efficient (and more socially acceptable!) solution.

So, how we do optimize this?  Well, a commonly suggested approach is to eliminate the CURSOR by replacing it with a WHILE loop:

    declare @ProductID int

    set @ProductID = -99999



    while (@ProductID is not null)

        begin

        set @ProductID = (select top 1 ProductID
                          from Products
                          where ProductID > @ProductID
                          order by ProductID asc)


        exec DoSomething @ProductID


        end

Instead of declaring a CURSOR to loop through the table, we now are using “set-based” code and our problems seem to be solved.  The cursor is gone, our code looks much cleaner, we’ve tested it and it works properly, so off to production it goes.  Another cursor has been busted!

Right?

Actually … no.

You see, eliminating cursors is not about syntax.  It is not about searching for the word “cursor” in your code and just replacing it with a WHILE loop that does the same thing.  Optimizing and replacing cursors involves much more.  We can never optimize any cursor code until we look deeper into what exactly is happening when we “process” each of those rows.  In this case, we need to find out what that “DoSomething” procedure is actually doing. 

Suppose the DoSomething procedure is generating a report and sending an email to the “Product Manager” for each product that contains status information, and then logging this email message into a table somewhere.

If that is the case, what have we just gained by replacing our CURSOR?  

Honestly — not much,  if anything at all.  Because of the task at hand, we may very well need to process rows in the Product table one-by-one to send our emails and generate the report, and the bottleneck here is not the cursor code at all, but rather the report generation and maybe sending the email.   Eliminating the cursor code probably gains us nothing here.  If you need to process rows one at a time, go ahead and use a cursor — that’s what they are there for!   Replacing a perfectly fine, simple cursor with a WHILE loop might even make your code longer, or more confusing, or even less efficient depending on circumstances. 

For example, what if we need to process the Products ordered by Region, then Product Name, for whatever reason.  Our cursor code is simple:

set @Products = cursor for

    select ProductID

    from Products

    order by Region, ProductName

All that we needed  to change was our ORDER BY clause.  Now, how would we write this as a WHILE loop?  Is it possible?  Sure.  Will it be as simple and clean as using a cursor?  No, it won’t. (Though ROW_COUNT() makes this much easier than it used to be)

Now, I am not here to say that cursors are “good”, but if you really need to process rows one by one, go ahead and proudly use a cursor.   Replacing cursors isn’t about processing rows one-by-one in a different way (i.e., using a WHILE loop instead), it is about not processing rows one-by-one at all!   

Let’s consider another scenario: What if the DoSomething stored procedure is checking to see if the Product’s ExpireDate is greater than today’s date, and if so, it is updates the Status column for that Product to ‘X’.

In that situation, what have we gained by rewriting ProcessProducts without a cursor, and using a WHILE loop instead?   The answer is, once again: nothing!  In fact, we potentially have once again made our code more confusing or even less efficient than a cursor might be!  Remember, the bottleneck isn’t the cursor syntax — it is the fact that we are processing rows one at a time.  Replacing the cursor with the WHILE loop didn’t solve this problem, did it?  

So, looking now at both of the scenarios I presented for the DoSomething stored procedure, it should be clear that we did not fix anything by replacing the cursor in either case simply by writing a WHILE loop.  If that’s all you are doing, don’t bother replacing the cursor at all.  You haven’t optimized anything.

As I said before, the art of replacing a cursor is not a find-and-replace syntax change operation — it is a fundamental change in how you process your data.  As in the Product report generation and email example, it may be that we simply need to process rows one by one, and thus no further optimization is possible from a SQL point of view.  In situations like updating the Product table, however, we do not need to process the rows individually — we can do everything in one single UPDATE statement.  Thus, in order to determine how to optimize the ProcessProducts stored procedure, we needed to dig deeper into entire process as a whole, which included examining the DoSomething stored procedure and determining the full scope of exactly what this “ProcessProducts” stored procedure is doing. 

So, if “DoSomething” is updating the Products table as specified, we now know that a good replacement for our cursor code doesn’t result in a WHILE loop and calling a separate stored procedure over and over at all — it results a true, set-based solution:

create procedure ProcessProducts

as

    Update Products set Status=’X’ where ExpireDate > getdate()

   

And THAT is how you optimize a cursor! No loops, no calling of another stored procedure for each row in a table, no “find-and-replace” cursor code removal.  We examined the entire process, and rewrote the entire process, to get it done quicker and shorter and faster without cursors or loops. 

Always remember: Replacing a cursor isn’t about rewriting your syntax, it is about redesigning your algorithm.