innodb_file_per_table, shrinking table spaces and the data dictionary

INNODB has some irritating gotchas that makes disk space management hard. In 2002ish INNODB, added innodb_file_per_table to get around allot of these issues, but it does not fix everything.

If you are running innodb_file_per_table, you will notice in your database directory

  • db.opt - database characteristics file.

  • tablename.frm - the table structure.

  • tablename.ibd - the actual innodb table space file



Imagine that you have a table with 10 million rows and you delete say 5 million rows in multiple chunks around 400K chunks, because deletes are slow. Next, you notice that the table space file did not shrink. So what do you do? OPTIMIZE tablename, tada all the wasted space is reclaimed, but here is the PROBLEM the ibdata file grew!

ibdata stores all of the UNDO LOGS thus GROWS due to the deletes and space is never reclaimed.


Ok, lets try coping the tablespace file to another directory and re-import the tablespace file after wiping the data dictionary (ibdata).

For instance
/etc/init.d/mysql stop
cd /var/lib/mysql/DB
cp * /tmp/hold
rm /var/lib/mysql/ib*
cp /tmp/hold/* /var/lib/mysql/DB
/etc/init.d/mysql start // create the ibdata file

ALTER TABLE tablename IMPORT TABLESPACE
ERROR 1146 (42S02): Table 'DB.tablename' does not exist

Really, there is nothing that you can do to force innodb to purge the undo logs in the ibdata file without a full dump of the data to text and then reimport the data.

So here are the steps to shrink all table spaces and the ibdata file

mysqldump --all-databases (or use mk-parallel-dump)
stop mysql
rm -f /var/lib/mysql/ib* /var/lib/mysql/DB/*
start mysql
mysqlimport (or use mk-parallel-restore)

Takes a bunch of time but there is no other recourse. If you know of another way please share :)

Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP

I'll be giving a web presentation Jan. 19th 2010. If you would like to check it out please do!


Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP


I thought it went well. I will post the link to the slides here :)

An interesting take on “Stored Procedures” (link)

Another great post SQL-related post over at TheDailyWTF regarding a, umm, "creative" use of "Stored Procedures":

http://thedailywtf.com/Articles/For-the-Ease-of-Maintenance.aspx



Asynchronous Queries verses Synchronous Queries

In a procedural language without the use of threads (or Inter Process Communication via forks), to execute I/O requests they are done one after another. Synchronous Queries produce at best a Big-O of N such that N is an element of I/O communication (queries) and N equals the number of queries needed to achieve the requested dataset.
With IPC or threads we can speed up common O(N) problems to reduce the N with parallelism, its still functionally a O(N) yet from a single instance point of view N is much less because threads (IPC) takes that Serial computing component and executes the code in parallel. To better explain what I am talking about lets look at some PHP code:


foreach($friends as $friend){
$data[] = getMySQLData(“SELECT * FROM AccountData WHERE userid = $friend);
}


The Primary key for the AccountData table is userid. Assuming that you have 5000 friends, the query has to be executed 5000 times.
We can reduce the O(N) and change it to a O(nlogn) (Binary Tree - doesn't take into account other factors) by switching the query to


$data = getMYSQLData(“SELECT * FROM AccountData WHERE userid IN (….)”);


We just sped up the retrieval of the data significantly, yet we just introduced a bottle neck on the datalayer. Our architecture requires that the data is located in a single location.

What if AccountData’s data is spread across many servers federated by userid? This means that userid belongs to a server, so the server contains a shard of the AccountData’s Data.

Now we are back to a O(N) where each query needs to be executed on the corresponding shard. The logical next step is to group queries per shard and run across them all. For instance


$multiShardIDs = $genericShard->getMultipleShardIDs($objIds);
foreach ($multiShardIDs as $shardID => $shardUserIDs) {
if (stripos($orgQuery, " WHERE ") !== false){
$query = $orgQuery." AND {$column} IN (".implode(',', $shardUserIDs).") ";
}
else{
$query = $orgQuery." WHERE {$column} IN (".implode(',', $shardUserIDs).") ";
}

$shard_to_sql[$shardID] = $query;

.... more stuff ....



Yet this is still a O(N) its just that N is smaller. Each query is still executed serially.
Let’s look at some stats of synchronous queries of SELECT 1; This query is executed across 35 shards and the timings are from PHP point of view.

FieldEnd ValueStart ValueDelta
ru_oublock00 0
ru_inblock00 0
ru_msgsnd00 0
ru_msgrcv00 0
ru_maxrss00 0
ru_ixrss00 0
ru_idrss00 0
ru_minflt98729865 7
ru_majflt00 0
ru_nsignals00 0
ru_nvcsw1134411114 230
ru_nivcsw977968 9
ru_nswap00 0
ru_utime.tv_usec865054849053 16001
ru_utime.tv_sec1616 0
ru_stime.tv_usec556097552097 4000
ru_stime.tv_sec11 0
Total Execution Time0.18323707580566




As you can see, to execute this from PHP it took 100 ms, 100s pages reclaimed and 200s voluntary context switches to query 35 servers.

Now let’s look at Asynchronous execution of SELECT 1; // the query generation is from PHP yet the execution is performed on a server that executes the query in parallel
FieldEnd ValueStart ValueDelta
ru_oublock00 0
ru_inblock00 0
ru_msgsnd00 0
ru_msgrcv00 0
ru_maxrss00 0
ru_ixrss00 0
ru_idrss00 0
ru_minflt91319121 10
ru_majflt00 0
ru_nsignals00 0
ru_nvcsw38913889 2
ru_nivcsw290290 0
ru_nswap00 0
ru_utime.tv_usec596287596287 0
ru_utime.tv_sec44 0
ru_stime.tv_usec460028460028 0
ru_stime.tv_sec00 0
Total Execution Time0.019363880157471




As you can see from the table above executing the query asynchronously produced results with less context switching, less pages reclaimed and almost 10 times execution improvement over the synchronous query counterpart.
How is the asynchronous query executed? Lets take a look at the figure below.

Async


So A user comes through the firewall / load balancer with a HTTP Request to the www pool that runs PHP. PHP now makes a CURL request to the Async Shard Servers (through a LB same LB different PORT). The HTTP Request to the Async Shard Server contains the SQL we wish to execute. The Async Shard Servers has a thread per shard and executes the request in parallel. The results are merged and sent to the calling CURL process via JSON. The returned JSON is then converted into a PHP object. This is a typical three-tier environment.

When having to query multiple servers using an Asynchronous Tier is dramatically faster; in fact its as fast as the slowest server. This is the main sticking point of why asynchronous queries are faster then synchronous queries (in this context) since the total execution time for serial queries is the SUM of all the query execution.

The current version of the server is used for Friend Query execution across the datalayer. Its been solid for a few months now, and I'm currently getting permission to release it as an Open Source Product. The features this server contains:

  • Lightweight

  • CPU bounded

  • Scales Linearly

  • A Timer Thread to keep the database config up to date in memory and fetching the config from PHP so if PHP changes connections to the shards so does Java

  • Uses Java-6 Executor Service

  • Merges the result set prior to sending it to the calling process

  • Communicates via JSON

  • Uses MySQL Connector/J

  • Supports a high concurrency

  • Optimized thread usage

Programming is like dreaming?

This is from March, so it's a little old, but I just stumbled upon it and found it a bit interesting. I never thought of programming this way before, but it does make sense:

Programming is like a dream


Is it just me …

... or is about time I got back to some blogging?


Steps I take before upgrading mysql

I am not a fan of upgrading mySQL unless I need to. I am of the mind if it is not broke don't fix it, but when I do upgrade I follow these general steps.

If I have run into a mySQL bug, I look to see if that bug is fixed by searching the mySQL bug database.

If I've notice a performance bottleneck, I look to see if the performance bottleneck has been fixed by searching the same database.

I will NOT upgrade to the latest and greatest version of mySQL (5.4) I stay within my branch (5.0).

These are my three general motivations that drive my upgrade decisions. Anytime I upgrade I also make a list of things that might affect my environment for the stuff I use.

    Here are my steps:
  1. Check the change log
  2. Ignore all the NDB changes... I don't use it and that's the majority of fixes. This is also, why I do not use it.
  3. List the changes that will affect the production environment
  4. Deploy the version that I picked on a few servers running my original config
  5. Do data corruption tests (make sure my checksum scripts return the same data)
  6. Verify that the problem I'm trying to fix is fixed
  7. Deploy to more boxes
  8. Let the new server bake for a period of no less than a week
  9. Deploy everyplace



So now, I'm upgrading from 5.0.56 to 5.0.86. What I'm trying to fix is mysql memory overhead at high levels of ram.

For instance, I have a slew of 48GB boxes. I set the bufferpool to 40GB; the OS uses 1 GB of memory (roughly) leaving an overhead of 7GB for the system cache and various spikes of sort buffers. Over a period, I see that mySQL will consume and hold onto 47GB of memory for an unknown reason even with some tight my.cnf settings. (I'm certain they are tight since I know what each buffer does). Therefore, testing some later versions of mySQL we found that these later versions do not grow past the settings defined yet performs the same.

Next, since I decided that upgrading is a good solution, now it’s time to list all the changes that fixes things.


  • 5.0.58 - INNODB performance fix
  • 5.0.60 - various problems that I should be affected by but havn't noticed so it’s fair to assume that said problems were introduced after my build.
  • 5.0.62 - nothing major noticed the sp releases that's why I wait.
  • 5.0.64 - nothing major
  • 5.0.66 - security fixes and fixes to fix the bugs introduced from this build.
  • 5.0.67 - two INNODB performance fixes and crash bug fixes.
  • 5.0.68 - changes show status and fixes an innodb crash bug.
  • 5.0.70 - fix another INNODB crash bug and security fixes
  • 5.0.72 - more general bug fixes
  • 5.0.74 - more stuff I don't care about
  • 5.0.75 - stuff given to Enterprise users now in community
  • 5.8.76 - more bug fixes that I do not need
  • 5.0.78 - more bug fixes I do not care about (run MS Access on windows not mySQL)
  • 5.0.80 - problem with error messages for concurrency limits that caused an assert failure
  • 5.0.82 - Fixes to fix fixes for this build.
  • 5.0.83 more minor fixes that I don't seem to have a problem from
  • 5.0.84- more bug fixes for INNODB and latches
  • 5.0.85 - looks like windows fixes
  • 5.0.86 - fixes that I'm not having problems with


Therefore, overall, upgrading should give me a boost in performance. My own internal testing sees some tighter memory usage, even though this is not fixed explicitly, the product has matured overall so I can account for the reduction in memory to that.

Nagios Event Handlers - Love them

What is Nagios? Nagios IMHO is the best Open Source monitoring system out there. It supports hosts checks, a level to determine on a host level if a box is considered "up". It supports service check, a level to determine if a particular service such as mySQL is up. It has features to log all events to a flat file or to a DB. It can notify you when a service is in a warning state, error state or unknown state.

For the purpose of this article, I am going to talk about handling events such as a clearing up swap.

First, let us look at some configuration of Nagios. We are going to define a command, then service acting on that command. Let us assume that the nagios install is in /usr/local/nagios.

Therefore, in /usr/local/nagios/ a few configuration files are key:
- /usr/local/nagios/etc/objects/commands.cfg - the command file where the checks are defined
- /usr/local/nagios/etc/hosts/*/hosts.cfg - the services file where the checks are defined for execution based on other directives in this file.


A command:

# 'check_local_swap' command definition
define command{
command_name check_local_swap
command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$
}


This says that check_local_swap executes check_swap with a warning threshold of $ARG1 and a critical threshold or $ARG2


Next when defining a service for a host

define service{
use generic-service; Name of service template to use
host_name dbfacebook34b ; hostname
service_description SYS:Swap ; what shows up in alerts
is_volatile 0
check_period 24x7 ; threshold when to check (all the time)
max_check_attempts 4 ; threshold to check before marking state
event_handler handle-swap ; handle an event (another command)
normal_check_interval 5 ; in seconds
retry_check_interval 1 ; only try once before reporting the state
contact_groups itops ; contact group to send notifications to
notification_options w,u,c,r ; need to look this up for all defs
notification_interval 600 ; retry sending notifs every 8 mins
notification_period 24x7 ; keep sending them
check_command check_nrpe!check_local_swap!80%!55% ; execute the event handler and warn like hell
}



Lots of goodies as you can see. Let us look at the event handler

define command{
command_name handle-swap
command_line /home/scripts/handle_swap.pl
}


This means execute this script whenever any event for swap occurs (I decided to make this simple and not put a threshold on this).


What does handle_swap.pl do - well it’s a perl script that looks at free memory and if only a few 100K of swap is in use, swapoff -a; swapon -a;

In this case, it is a bit safe to do this. Why do this? Why not just turn of swap. I have talked in depth about this subject-but for a minor recap. Linux needs swap else, kswapd will freak out. Swap in DB's is bad so I clean it up automatically since O_DIRECT on my SAN is not an option.

Why not just run a cron job? Nagios keeps a log, I like to review what is happening from a central location, and nagios is freaking COOL.

Back From Vacation and Man do I feel great.

For the first time ever I took a two-week vacation. How can a person who has 100s of database servers, 100s of web servers, and a system that supports 100K tps across 20 TB of data is able to take a vacation? Easy, I have a great team that is very competent in managing the platform by following our cookbook routines and guidelines for new application interaction with the databases.

Where did I go? I went to NYC-stayed in Tribeca and only ate Pizza, Hotdogs, White Castle and Hala Food from vendors that are nearly on every street corner. I also lost 6 pounds! After three months of half-ass dieting to prepare for my Pizza diet, who would have known that just going for it was the key to success! How did I do it? Well, first I walked to Little Italy, bought 2-4 slices, ate them, and then walked around the city. That is it. As a Database Dude, I do not do much walking except for when I am too tired to run on the treadmill for 20 mins. I was averaging around 10 miles a day for a 12-hour period walking around NYC taking pictures, enjoying the sites, eating, drinking, and then repeat. I even walked from Canal Street all the way to the Staten Island Ferry just because. Man it was great.
After NYC, I went to Puerto Rico for the second time. My family and I had a freaking awesome time. We went to the only US National Park that is a Rain Forest. Went to Vieques to the Bio Luminous Bay, where the single cell organisms light up when kinetic energy is applied to them. Went to Culebra, to Flamenco Beach – which is off the hook: much better than any other beach I have been to. Drove around the entire Main Puerto Rican Island stopping off in Rincon and eating every 3-4 hours.
I am fully rested and feel like a 20 year old again. So, expect a lot more posts. I think the order will be:

Migrate 20 TB of data into a new Format without downtime.
How to make Friend Queries work with a database.
Throughput increase from Asynchronous Queries.

mysql UC 2009 Talk

Scribe is a bit buggy with displaying this presentation:

Scaling a Widget Company