Archive for May, 2008

Log Buffer #98

Hello and welcome to the 98th edition of Log Buffer. My name is Jeff Smith and I will hosting this week’s exciting episode. If, for some reason, you are not completely satisfied with this edition, simply write in and complain to Dave over at The Pythian Group and you will receive Log Buffer #99 absolutely free! Now that is a guarantee you can feel good about. OK, let’s get to work.

I have only limited exposure to both PostgreSQL and MySQL, but I have often wondered why MySQL is so popular while it seems that PostgreSQL has the superior features.  Over at Xaprb, they attempt to answer that very question.  Be sure to read the comments from that post, and check out the big discussion from that article over at reddit as well.  The theory I like the best?  MySQL is easier to pronounce!  (How do you pronounce “PostgreSQL” anyway?)

Speaking of MySQL, Sheeri Cabral points out that MySQL’s website certainly doesn’t do the product any favors, and there’s also a good discussion at Xaprb on why MySQL is Free Software but not Open Source.  If you ever wanted to add a new Unicode collation to MySQL, Alexander Barkov and Peter Gulutzan provide all the information you’ll need.  Peter at the MySql Performance Blog tells us that MySQL lacks a good memory profiling tool, and based on his feedback, others seem to agree.  (No, not those Others!)  Speaking of MySQL feature requests, Justin Swanhart asks “Why does INFORMATION_SCHEMA fail to show information about TEMPORARY tables?”  and also lets us know that his materialized view stored procedures for MySQL have been OKed for releaseSunny Walia (what a great name — is it possible to not be a fun person with a name like that?) tells us how to install innotop to monitor innodb information in real-time and wonders “Oh dear MySQL slave, where did you put those rows?”  Going back to the MySQL Performance Blog, Vadim warns us of a dangerous MySQL command; be sure to keep that one locked safely away from the kids.

Regarding a product I actually know a little about, Kalen Delany has a nice list of Free SQL Server Troubleshooting Tools to check out.  If you haven’t seen it yet, my co-blogger here at SQLTeam Mladen has an amazingly popular list of Free SQL Server Tools that might make your life a little easier that was published a while back but it is always worth mentioning.  While you are visiting Mladen’s blog, don’t miss his latest post on getting immediate deadlock notifications for SQL Server 2005.  Also, be sure to leave him lots of comments telling him that his blog is great but that he is your second favorite SQL Server blogger — after me, of course!

Still on the topic of SQL Server, Jamie Thomson provides us with a tip for ensuring that your root folder is valid when using SSIS.  Denis Gobo asks: What did you do to master SQL?   (Interestingly enough, for me it was by learning MS Access first!)  Tony Rogerson warns us of the performance implications of using Row_Number() in non-recursive CTE’s.  And Paul S. Randal describes a CHECKDB bug that people are hitting; thankfully, he says that “you can only hit this bug if you ALREADY have corruption, that it’s quite rare, and that there is a workaround.”

Everyone enjoys a good analogy, right?  After all, a good analogy is like an ice cream cone: they both are … hmmm … OK, well, that’s not a good analogy at all.  Never mind. Speaking of bad analogies, I bet that unlike Peter Gulutzan you never really thought about the expression “half baked” before and how it relates to MySQL features.  Well, now there’s your chance!  (Of course, a pessimist would prefer “half un-baked”, but that’s a discussion for another time.)

A big topic lately has been SQL Injection attacks.  I always find this funny because this is the easiest problem to avoid in the history of programming; as CodeAssembly tells us, “Never concatenate user input to your queries, without exceptions.”  That’s really all there is to it — do that, and you are good to go.  As I’ve written before, using parameters is not only safer, but your code is much shorter and simpler than if you concatenate strings all day long. 

Federico Cargnelutti gives us an introduction on managing and applying database changes with LiquiBase, an “open source, DBMS-independent library for tracking, managing and applying database changes.”  I have never used LiquiBase, but sounds like something worth looking into.

While reading Magnus Hagander’s PostgreSQL Blog, I found out that Yahoo claims it has the largest SQL database in a production environment — and they use PostgreSQL.  Impressive!  Peter Eisentraut checks in from PGCon Day One, which included a presentation of his on porting Oracle Applications to PostgreSQL.   For those of you out there using Max OS X, Perldiver has summarized instructions on building PostgreSQL on Mac OS X.  Going off on a tangent, they just opened a new Apple store here in Boston on Boylston street.  I visited it this weekend after getting my usual bad haircut next door.  My verdict on the store?  It sure looks nice, but I had no luck finding a new 5 1/4″ floppy drive for my Apple II.  Try to do better next time, Apple!

James McGovern offers some praise for Mark Wilcox of Oracle.  Why? Because Mark has been doing some must-read blogging over at the Oracle.com blogs.  Getting back to my favorite topic, which is coding SQL, Michael Armstrong-Smith instructs the Oracle crowd on using CASE to solver Outer Join issuesShay Shmeltzer provides some tips on creating a master with two details on the same page when using ADF. 

Sticking with Oracle links, Pete Finnigan ponders read only tables or read only users, and notes that in Oracle a read-only user “has approximately 27,000 other privileges because of grants to PUBLIC. This is the killer issue as because of this it is in fact not possible to create a read-only user.” Hmm … only 27,000?  Come on, that doesn’t seem that bad to me!  Eddie Awad tells us about the Lazy Developer’s way to populate a Surrogate Key and over at the Oracle Scratchpad, Jonathan Lewis provides some helpful links on Index Efficiency.  Finally, if you are looking to install Oracle Database 11g Release 1 on Fedora 9 (and who isn’t?), everything you need to know is covered over at Oracle-Base

Now, if you’re like me, you hate DBAs.  Ah, just kidding, of course we all love our Database Administration Overlords (and I’m not just saying that because most of the people reading this probably are DBAs.)  However, even the best DBAs out there occasionally make mistakes.  If you have some horror stories of your own to share, or if you simply want to take pleasure in the misfortune of others, be sure to check out Kalen Delany’s call for DBA Blunders.  (Of course, to be fair, even us developers occasionally make mistakes.) To help avoid future blunders, consider this advice: Do not use Windows System Restore as a backout plan for SQL Server Service Packs, Cumulative Updates, or HotFixes.  Also, Tara here at SQLTeam reminds us to optimize your tempdb and even provides a helpful script.  I’d like to add my own helpful tip for DBAs:  Schedule regular database backups!   Remember, you read it here first.

Previewing the upcoming SQL Server 2008, SQLTeam’s Derek discusses the Data Profiling Utility with SQL Server 2008.  It sure seems like a nice tool, but that still may not make people any less nervous about SQL 2008.  (Heck, my team is still nervous about SQL Server 2005!)   Aaron Betrand urges people to vote if you want IntelliSense in SSMS 2008 to also support SQL Server 2005, which seems like a great idea to me, and Linchi Shea has a quick analysis of SQL Server 2008 Page Compression and its performance impact on table scans.  Finally, Jamie Thomson dissects the fuzzyness of SQL Server 2008.   To me, “fuzziness” is what happens to my vision after drinking too many mojitos, but Jamie is discussing a new feature in SSIS 2008 so give it a read.

I’ve always believed that you don’t truly know all there is to know about databases until you understand the raw data structures of tables and indexes and so on.  Over at MSDN Channel 9, there’s a new series of videos on Data Structures and Algorithms, so be sure to watch if you want to know how database engines really work “under the hood”.  My enjoyment of the video was unfortunately interrupted by horrible flashbacks from my CS310 days.

Moving away from relational databases, Jim Wilson helps us to understand HBase and BigTable.  Apparently, HBase is the open source implementation of Google’s BigTable database, which is described as a “sparse, distributed, persistent multidimensional sorted map.”  In layman’s terms, that means “a database with wicked huge tables.”

For those looking for a laugh, be sure to read Andrew Calvett’s MS SQL Server Book of Wisdom.  It reminds me quite a bit of my infamous and widely misinterpreted Top 10 Things I Hate About SQL Server post from way back in the olden days.  Be careful, Andrew: sometimes folks don’t get it if your jokes are too subtle!  (Of course, in my case, it could be that my jokes just weren’t that funny.)

Well, that’s all for this week.  Thanks, Dave, for giving me an opportunity to write this week’s Log Buffer.  It was lot of fun and a welcome opportunity for me to spend more time than I usually do reading lots of great blog posts from around the internet.  Have a great weekend everyone!

Implementing “Interfaces” in SQL

My latest article has just been published over at SQLTeam:

    Implementing Table Interfaces

When I wrote a Table Inheritance article a few months back, the technique shown was pretty standard and straight-forward.  As I was writing it, I thought it would be an interesting challenge to figure out a way to implement table interfaces as well, where different tables don’t inherit from the same base class, but they still “implement” the same relations.  That definitely was not as easy, and the end result isn’t as clean and direct, but I hope this at least provides some ideas and at the very least it should provoke interesting comments and alternative approaches.

Need an Answer? Actually, No … You Need a Question

Welcome!

The reason you were directed here is because you need assistance, and I am here to help.  I am not, however, here to provide you with any answers!  You see, it looks like the assistance you need is not finding an answer; it is rather that you need assistance finding a question.

As you know, there are all kinds of questions.  Questions that test memory recall.  Questions that test logic skills.  Brain-teasers and mathematical questions and so on.  But there is one requirement that all good questions must have in common before they can be answered:

A proper question MUST provide ALL of the information necessary in order for an answer to be given. 

In other words, if you omit important information from a question, it doesn’t matter how simple or easy that question is:  It suddenly becomes very difficult, or even impossible, to answer. 

For example, consider the following question:

“Am I wearing a hat?”

Seems pretty easy, right? No logic, no memorization, no trivia, no knowledge of any specific topic is required. 

So … what’s the answer?  Take a few minutes, think about it, write it down on a piece of scrap paper.  I can wait, take your time …

What’s that, you say? You can’t answer that simple question!?  Why not?  I stated it very clearly, it requires a simple YES or NO response, there’s nothing tricky there.  So, why would anyone have any trouble giving an answer to something so basic and simple?

The reason, of course, is because you can’t see me.  You have no way of knowing what I am wearing because I did not provide you enough information!  As simple as it is, it cannot be answered; therefore, it is not a proper question!

Suppose, instead, I provided a picture of myself and asked “In this picture, am I wearing a hat?”  And, in the picture, my head is clearly visible and the fact that I am wearing a Boston Red Sox cap is very clear.  Would you be able to answer the question in that scenario?  Of course!  Suddenly, what was an impossible question to answer became very simple! 

How did that happen?

It happened, of course, because I provided you enough information to answer the question!  And that is the often problem with many of the questions we see day to day in forums asking for help.  You cannot expect an answer unless you provide a proper question with all the necessary information.  The majority of the effort by those helping others in these forums is not spent answering questions, it is spent trying to figure out what the heck the question actually is!  And that is the problem; people don’t seem to realize that they can’t just randomly cut and paste code or ask vague questions without any context and expect to receive help! 

Yet, requesting more information and details doesn’t always go over so well … Those looking for help seem to often have trouble understanding why the helpers need more info, why are they are so “anal” and “demanding” about minor things like database schemas, or sample data, or code samples.  “How is that important?” they wonder. “Just answer the question and provide me with some help, please!  An expert would know the answer!”

My goal today is to hopefully help you to understand why information and context is so important for even simple, basic, questions, and how providing that information suddenly transforms a poor question into a very good one that can be quickly and accurately answered. 

Let’s try another example:

“How do I get from work to Fenway Park?”

So, what do you think?  Is that a good question that provides all of the information?  It seems simple, right? It is just asking for basic directions. Yet, where is “work”?  Is it asking for driving directions, or walking directions, or maybe which subway lines to take?  Who knows!!?

As stated, this question simply cannot be answered!   Yet, if I just  thought about it a little and made sure to provide all of the necessary information, this “impossible” question with no answer suddenly becomes a very simple one:

“How I do get from 125 High Street, Boston, MA to Fenway Park via the subway?”

See the difference? Instead of just assuming that everyone knows where I work and what mode of transportation I am looking for, if I make sure to simply tell them, there is no uncertainty, no confusion, no guesswork, and the question can be answered.  This isn’t rocket science, right?  Yet, these common-sense basics seem to elude many, many, people!

Would you drop your car off at a mechanic with a note on it that says:

“Car doesn’t work.  Plz fix.  It is urgent! Thx!!”

I sure hope not.  You’d explain what’s wrong, right?  It has trouble starting, it has a flat tire, there’s smoke coming from beneath the hood, the steering wheel fell off, and so on. It’s basic common sense that you would do everything you can to be sure that the mechanic has the information he needs to fix your car correctly and promptly, right?  Shouldn’t that same logic also apply when asking for help in forums?

Finally, let’s try a SQL Server question:

“What is wrong with teh codez? it does not work!  Plz Help!



select SaleID, Customer, Qty, Price

from SalesNumbesr

Thnx!”

Take a look at that question.  It is a very simple SQL statement, right?  There’s nothing there that a beginner could not understand.   Can you “spot” the problem and fix it? 

Hmmm … maybe you can, maybe you can’t.  You can’t really be sure, can you?  At this point, we can all try to guess what the problem is.  What does “it does not work” even mean? Is “SalesNumbesr” a typo?  Should it be “SalesNumbers” ?  Is it returning too much data?  not enough data?  Incorrect data?  Is it generating an error?  And so on. 

We could spend all day trying to guess what the question is and provide answers to those guesses, but if the guesses are wrong, the answers won’t be so helpful, will they?  On the previous question, what if you guessed that I work in Cleveland and provided directions for me to Fenway Park from there?  Would that be helpful to me? Probably not, right?  Most likely, it just wasted everyone’s time.

So, getting back to the code …. what is wrong with it?  Well, in SQL terms, the answer is: NULL!  It does not exist.  Until we are provided with more information, the question cannot be answered. Thus, it is not a question at all, just an incomplete fragment.  As simple as the question looks, as basic as the T-SQL is, this “question” will stump even the greatest “experts” out there because an answer to this question simply does not exist.

If more information is provided, like this:

“Hi — I currently have the code below:



select SaleID, Customer, Qty, Price

from SalesNumbesr

I would also like to return the total Amount for each Sale, which is the Qty multiplied by the Price.  However, I am not sure how to add this to my current code.  Can anyone please help?”

Suddenly, the question now is very clear and the answer is very simple!  They just want to know how to add an expression to the result set. Just by providing a little more information, and not assuming that everyone knows what is happening outside of the context of what was written, something that was impossible to answer has become very easy. 

I sure hope this is making sense.

In fact, taking this whole article to its logical conclusion, I think we can safely say:

“The more accurate and detailed information a question provides, the more accurate and detailed the answers will be.”

In other words, a vague, incomplete question can only get, at best, vague, incomplete answers.  But a question that spells out the entire situation very clearly will get, at best, a very clear and specific answer that works in that situation. 

So, please, think of this when you ask questions in a forum.  Consider the fact that no one knows your specific environment, or code, or application, or database except for YOU.  And, no one can help you unless you are providing enough information for them to do so. 

Just like the mechanic. Or someone giving driving directions.  Or a doctor when you are sick.   You provide them with the necessary information so they can help you, right?  Consider doing the same to those providing you with (free!) programming advice.

. . .

(Feel free to provide this link to those who seen to have trouble understanding that you cannot read their mind when assisting them with programming help on forums.)

GROUP BY ALL

Here’s an obscure piece of SQL you may not be aware of:  The “ALL” option when using a GROUP BY.

Consider the following table:

Create table Sales

(

    SaleID int identity not null primary key,

    CustomerID int,

    ProductID int,

    SaleDate datetime,

    Qty int,

    Amount money

)



insert into Sales (CustomerID, ProductID, SaleDate, Qty, Amount)

select 1,1,’2008-01-01′,12,400 union all

select 1,2,’2008-02-25′,6,2300 union all

select 1,1,’2008-03-02′,23,610 union all

select 2,4,’2008-01-04′,1,75 union all

select 2,2,’2008-02-18′,52,5200 union all

select 3,2,’2008-03-09′,99,2300 union all

select 3,1,’2008-04-19′,3,4890 union all

select 3,1,’2008-04-21′,74,2840



SaleID      CustomerID  ProductID   SaleDate                Qty         Amount

———– ———– ———– ———————– ———– ———————

9           1           1           2008-01-01 00:00:00.000 12          400.00

10          1           2           2008-02-25 00:00:00.000 6           2300.00

11          1           1           2008-03-02 00:00:00.000 23          610.00

12          2           4           2008-01-04 00:00:00.000 1           75.00

13          2           2           2008-02-18 00:00:00.000 52          5200.00

14          3           2           2008-03-09 00:00:00.000 99          2300.00

15          3           1           2008-04-19 00:00:00.000 3           4890.00

16          3           1           2008-04-21 00:00:00.000 74          2840.00



(8 row(s) affected)

Suppose we’d like to see the customers that were sold Product #1 along with the total amount that they spent.

We would basically write a simple SELECT with a GROUP BY like this:

select CustomerID, sum(Amount) as TotalAmount

from Sales

where ProductID = 1

group by CustomerID

And sure enough, we’d get our answer:

CustomerID  TotalAmount

———– ———————

1           1010.00

3           7730.00



(2 row(s) affected)

Now, let’s say that we’d like to see all customers that have been sold any products, but we still just want to see the “TotalAmount” for ProductID #1.  For customers that have never ordered ProductID #1, it should output a “TotalAmount” value of $0.   One way to do this is with a CASE expression; instead of filtering so that only ProductID #1 is returned, we can conditionally SUM() the Amount only for orders for ProductID #1.  Like this:

select CustomerID, sum(case when ProductID=1 then Amount else 0 end) as TotalAmount

from Sales

group by CustomerID



CustomerID  TotalAmount

———– ———————

1           1010.00

2           0.00

3           7730.00



(3 row(s) affected)

That gives us the results we want.   Because we are not using a WHERE clause to filter the data, we see an entry for CustomerID #2 in the output. 

However, in situations where you have written the above SQL, you could actually replace the SUM(CASE…) expression by using GROUP BY ALL, instead of just a standard GROUP BY, like this:

select CustomerID, sum(Amount) as TotalAmount

from Sales

where ProductID = 1

group by all CustomerID



CustomerID  TotalAmount

———– ———————

1           1010.00

2           NULL

3           7730.00

Warning: Null value is eliminated by an aggregate or other SET operation.



(3 row(s) affected)

Notice that now all Customers are now returned, and a NULL is shown as the TotalAmount for Customer #2, who has no orders for ProductID #1 …  Even though though the WHERE clause seems to indicate that we should not be seeing customer #2 in the results!

The ALL option basically says “ignore the WHERE clause when doing the GROUPING, but still apply it for any aggregate functions”.   So, in this case, the WHERE clause is not considered when generating the population of CustomerID values, but it is applied when calculating the SUM.  This is very much like our first solution, where we removed the WHERE clause completely, and used a SUM(CASE…) expression to conditionally calculate the aggregate. 

Values that are excluded from the aggregation according to the WHERE clause have NULL values returned, as you can see in the result.  A simple ISNULL() or COALESCE() will allow us to return 0 instead of NULL:

select CustomerID, isnull(sum(Amount),0) as TotalAmount

from Sales

where ProductID = 1

group by all CustomerID



CustomerID  TotalAmount

———– ———————

1           1010.00

2           0.00

3           7730.00

Warning: Null value is eliminated by an aggregate or other SET operation.



(3 row(s) affected)

Notice that the warning about NULL being aggregated still displays, since that is the standard behavior in SQL Server when you calculate an aggregate on a NULL value.  You can turn these warnings off if you like for the during of the batch by issuing a set ANSI_WARNINGS off command before your SELECT.

GROUP BY ALL is kind of obscure and neat to know, but not really useful in most situations since there are usually easier or better ways to get this result.  For one thing, this won’t work if we want all Customers to be displayed, since a customer must have at least one order to show up in the result.  If we want to see all customers, even those that have never ordered, we would need to do a LEFT OUTER JOIN from the Customers table to our Orders aggregate SELECT:

create table Customers (CustomerID int primary key)

insert into Customers

select 1 union all

select 2 union all

select 3 union all

select 4



– Notice that we have 4 customers, but our Sales data has sales for only 3.



select c.customerID, isnull(s.TotalAmount,0) as TotalAmount

from Customers c

left outer join

    (select customerID, sum(Amount) as TotalAmount

     from Sales

     where ProductID = 1

    group by customerID) s on c.customerID = s.customerID

   

customerID  TotalAmount

———– ———————

1           1010.00

2           0.00

3           7730.00

4           0.00



(4 row(s) affected)

That is typically the standard way to return data for an entire population, regardless of existing transactions.  GROUP BY ALL gets us close, but if a new customer has never made an Order, they will never show up in the results.   Of course, depending on your needs, that may be what you want.

Another limitation is we can not use GROUP BY ALL if we want to return a grand total for all orders, along with the total just for ProductID #1.  For example, using the SUM(CASE…) expression along with a regular SUM(), we can do this:

select CustomerID, sum(case when ProductID=1 then Amount else 0 end) as Product1Amount,

    sum(Amount) as TotalAmount

from Sales

group by CustomerID



CustomerID  Product1Amount        TotalAmount

———– ——————— ———————

1           1010.00               3310.00

2           0.00                  5275.00

3           7730.00               10030.00



(3 row(s) affected)

That lets us calculate two different totals all in one pass through the table.  However, we cannot translate that using GROUP BY ALL, because while we will be able to return the Product1Amount, there would be no easy way to also get the TotalAmount for all products without an additional join or sub-query.

. . .

So, that’s the story with GROUP BY ALL. It is interesting, and not widely well-known, and may even make for a good interview question if you really want to see how much SQL a candidate knows.  But for practical purposes, it is pretty rarely used and there are generally better ways to get the same results more easily or more efficiently.

Anyone have a good situation or an example of where GROUP BY ALL really worked well for you?  Be sure to share your experiences in the comments.