18 months of dealing with a MySQL ‘too many connections’ error

Database Crash

On a few separate occasions the database has become overwhelmed with ‘too many connections’ and then starts rejecting people from trying to view the website.

We are running:

  • Fedora
  • MySQL 5.0.27
  • WordPress 3.0.1 (the problems all occurred with WordPress 2.7-2.9)
  • phpBB 2.0.23
  • PHP 5.1.6

Current solution

The database has now been up for 220+ consecutive days. The main solutions were:

  1. tune the database according to MySQLTuner
  2. defragment the tables weekly based on this post
  3. put in place WP-Cache for WordPress
  4. Bad Behaviour for WordPress
  5. PhBad Behave based on Bad Behavior for phpBB
  6. finally create a Spam Pot for phpmyadmin sniffing based off the idea of the project honey pot site (see below).

I also put in place a cron job that tries to connect to the database every 15 mins, if it gets a ‘too many connections’ error, it tries to restart the webserver. It calls a shell script connection-test.sh that runs a php file in the same directory. It was almost completely ineffective – by the point it kicks in the webserver has ground to a halt.

If this doesn’t solve the problem read on…

[ACTION] Log what happened

Logs:

  • /var/tmp/php.log – will show php error messages of failing to connect to the database.
  • /var/log/httpd/access_log – will show who was accessing the site when the database errors occurred.
  • /var/log/httpd/error_log

First log what is happening, the extended-status command has a few useful variables you should check:

$ su -
$ ps -ef > /tmp/all_processes_[DDMMYY].txt
$ mysqladmin -u root -p processlist > /tmp/db_processes_[DDMMYY].txt

You should also log this, but once the server is back up and running:

$ mysqladmin -u root -p extended-status > /tmp/db_status_[DDMMYY].txt

[ACTION] Ways to get the server running again

Access to the server will be slow, but try these fixes in order, they go from the least to the most dramatic.

First stage

To fix the problem you can restart the webserver. Its better to call an explicit stop and let the main server begin responding quickly again before starting the webserver.

$ su -
$ service httpd stop
$ service httpd start

Wait 5 minutes to see if these changes take effect.

Next stage

If this fails try restarting the services, note the problems of restarting mysqld :

$ service mysqld restart
$ service httpd restart

You might need to do this more than once if there is no response. Restarting the webserver (httpd) is less drastic than the database (mysqld).

Wait 5 minutes to see if these changes take effect.

Final stage

$ reboot

You’re on your own if you get to this stage – it shouldn’t require a reboot.

Other solutions

Set the interactive_timeout to be one hour, which should kill off dead connections (the 19th Mar attack only lasted on minute and the connections then lay dormant). However by default the connections should have died after 8 hours but didn’t – so maybe a timeout won’t work. According to the first list item below the MYSQL_CLIENT_INTERACTIVE option must be set when making mysql_connect statements.

Here are a few useful pages on this:

A reboot of the server should not be required.

I’ve noticed that the query log file was massive at 6GB. I’ve backed up the log and started a new one.

Analysis of activity when the crash occurred

Sun 23-Nov-2008 11:28:23

PHP Error messages: 1040 (too many connections), 2002 (can’t connect through socket)

Thu 11-Dec-2008 08:16:05

PHP Error messages: 2002 (can’t connect through socket)

Thu 19-Feb-2009 17:45:25

PHP Error messages: 1040 (too many connections)

Thu 19 Mar 2009 19:54

PHP Error messages: 1040 (too many connections), 2013 (Lost connection to MySQL server during query)

There was a large number of connections from http://www.iguanahosting.com (IP Address 200.6.152.205). They started at 19:54:03 and finished at 19:54:51 i.e. all connections were sent within a minute to the root of our website. This overloaded the WordPress database and consequently the errors show up in the php error log from the forum (I wrongly assumed that it was the forum causing the problem). The database then thrashed around until 9am the next morning when I restarted the mysql server and the apache webserver.

Fri 24 Apr 2009 04:54

PHP Error messages: 2002 (can’t connect through socket)

Similar incident (to 19 Mar 2009) with many mysql queries active. Rebooted server, had to manually restart mysqld. Reconfigured so mysqld would automatically start in future.

The PHP logs when the database went down are the same as last time:

[24-Apr-2009 04:54:01] PHP Warning:  mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Lost connection to MySQL server during query in /var/www/html/forum/db/mysql4.php on line 48
[24-Apr-2009 04:54:02] PHP Warning:  mysql_error(): supplied argument is not a valid MySQL-Link resource in /var/www/html/forum/db/mysql4.php on line 330
[24-Apr-2009 04:54:02] PHP Warning:  mysql_errno(): supplied argument is not a valid MySQL-Link resource in /var/www/html/forum/db/mysql4.php on line 331

These three lines are repeated, with further errors coming in later:

[24-Apr-2009 06:36:23] PHP Warning:  mysql_connect(): Too many connections in /var/www/html/forum/db/mysql4.php on line 48
[24-Apr-2009 06:36:23] PHP Warning:  mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Too many connections in /var/www/html/forum/db/mysql4.php on line 48

[24-Apr-2009 08:56:40] PHP Warning:  mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2) in /var/www/html/common/include/database.inc.php on line 19
[24-Apr-2009 08:56:40] PHP Warning:  Missing argument 2 for show_db_error(), called in /var/www/html/common/include/database.inc.php on line 21 and defined in /var/www/html/common/include/dberr.inc.php on line 2
[24-Apr-2009 08:56:40] PHP Notice:  Undefined variable: email_string in /var/www/html/common/include/dberr.inc.php on line 4
[24-Apr-2009 08:56:40] Error connection with localhost for , 2002: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)

Thu 30-Apr-2009 00:41:17

PHP error messages: 2002

The crash seemed to be connected to a rogue spider application from the firm Kintiskton LLC – IP Address range 65.208.151.112 – 65.208.151.119. A Google search on them implies they have affected a few sites. Essentially they flood your server with traffic over a short period of time.

This time the cron job I set up did perform a httpd restart (at 04:06 and 07:50), which seemed to get the server working again on the second attempt. The cron job calls a php page that attempts to connect to the database and if it fails it calls the shell program ‘/sbin/service httpd restart’. It had previously been calling ‘service httpd restart’ which seemed to result in the follow

There is a possible fix to the 2002 / 2013 error messages which is to no longer use the socket connection and use a connection setting of 127.0.0.1 rather than localhos, which would avoid using the socket connection. This is treating the symptoms rather than figuring out the cause though.

It appears that there was some access to the database during the errors (00:41 – ~08:00).

A further attempt to mitigate this has been to allow a higher max_connections in the /etc/my.cnf at 300 in the hope that the total is too large for 100 but it might max out below 300. This may not be the best solution, but is worth a try.

Wed 20-Sep-2009 21:13:32

A similar set of problems occurred again:

[20-Sep-2009 21:13:32] PHP Warning:  mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Lost connection to MySQL server during query in /var/www/html/forum/db/mysql4.php on line 48
[20-Sep-2009 21:13:32] PHP Warning:  mysql_error(): supplied argument is not a valid MySQL-Link resource in /var/www/html/forum/db/mysql4.php on line 330

This took three attempts to solve. Firstly by using the First Stage of actions listed above – i.e. flushing the hosts and restarting the webserver. This got the website to come back up, but it got overwhelmed again within minutes. A second restart of the webserver, managed to stop the webserver but failed to restart it. When the web server was started again it came up ok and stayed up.

The other interesting thing is that the connections that have problems are all to do with the forum. Specifically it is the mail2forum add-in log. The log was in debug mode (set in the file /var/www/html/forum/m2f/m2f_config.php mentioned on the mail2forum website) which largely increases the database access. This means that if there is an issue, the connections fill up very quickly overloading the server.

I have put the logging down to the standard level now.

It is worth noting the the first set of database processes that cause the website to crash were due to the forum. After restarting the webserver the database became overwhelmed again, but this time with wordpress processes.

Sun 04-Oct-2009 21:00:58

Similar problems to previous (21-Sep) but more severe.

[04-Oct-2009 21:00:58] PHP Warning:  mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (11) in /var/www/html/forum/db/mysql4.php on line 48
[04-Oct-2009 21:00:58] PHP Warning:  mysql_error(): supplied argument is not a valid MySQL-Link resource in /var/www/html/forum/db/mysql4.php on line 330
[04-Oct-2009 21:00:58] PHP Warning:  mysql_errno(): supplied argument is not a valid MySQL-Link resource in /var/www/html/forum/db/mysql4.php on line 331
[04-Oct-2009 21:01:58] PHP Warning:  mysql_connect() [<a href='function.mysql-connect'>function.mysql-connect</a>]: Lost connection to MySQL server during query in /var/www/html/forum/db/mysql4.php on line 48

Logged all the problems whlist the server was slow. Took approx 40 mins. Flushed the mysql hosts and then tried restarting the httpd service. The PuTTY? connection then stopped responding before the httpd service was stopped. Further attempts to connect via PuTTY failed. Then the server was manually rebooted by the offsite hosts.

Almost all of the connections to the server were wordpress connections, none especially seemed to be causing a problem. At the time of the crash an IP address in Canada was trying to guess the URL of phpMyAdmin. The first two attempts resulted in a 404 not found error. The next 8 requests gave 500 internal server errors. All the attempts should have returned simple 404 not found requests.

After following an idea on improving MySQL performance I installed MySQLTuner onto the web server which pointed out that the max number of threads (300) would use up 108% of the RAM. This was probably the reason for having to do a hard reboot. I have set the max threads to be 200 (in /etc/my.cnf) which should max out at 74% of the RAM. Hopefully then the cron job listed above should restart the webserver killing off all the httpd connections allowing the website to come back up.

Fri 23-Oct-2009 16:47:57

This crash was on Friday afternoon. There was an abnormally large number of connections from this IP: 147.91.22.101 around the time when the site went down. Doing a google search for this finds that it appears to be some random bot. This article appears to have some good advice on dealing with various spikes in traffic using wordpress databases: http://blog.circlesixdesign.com/2007/01/22/diggproof/

Tue 27-Oct-2009 22:18:32

This occurred late on a Tuesday night. Still investigating the cause. The IP: 147.91.22.101 was again the cause it seems.

Wed 04-Nov-2009 11:29:37

There were php errrors for about 11 minutes until 04-Nov-2009 11:40:36. Cause unknown.

Fri 06-Nov-2009 22:26:57

This occurred on a Saturday morning. The website actually managed to right itself over the weekend – but it was down for approx 24 hours. As a result I tried following some of the MySQLTuner? (see ’04-Oct-2009 21:00:58′ above) advice. The following was added to the /etc/my.cnf file:

skip-bdb
log-slow-queries = /var/lib/mysql/ns-slow.log
long_query_time = 5
table_cache = 400 #modified from a value of 64

The table cache was modified based on a conservative value given in the mysql manual. It is releated to the max connections * max joins in a query. We have a max connections of 200 and I gave a conservative estimate of 2 for the max number of joins.

I came across some posts that indicate that the Simple Tags plug-in we use is causing problems with slow queries. I have disabled this plug-in for now.

I went through all the tables optimising the ones which were fragmented. There is an excellent post with a script for optimising all the fragmented tables. It uses at its core this SQL:

use information_schema;
SELECT TABLE_SCHEMA,TABLE_NAME
FROM TABLES
WHERE TABLE_SCHEMA NOT IN ('information_schema','mysql')
AND Data_free > 0

We’ll see what happens now…

15-Nov-09 10:24:11

So the database crashed again. This time though the slow query log (/var/lib/mysql/ns-slow.log) caught what was happening 3 minutes before errors showed up in the php log (/var/tmp/php.log).

Queries start normally taking 0.02 seconds start taking 6 seconds – there was probably a build up before this as slow logs only kick in at 5 seconds.

The crash it seems was caused by a large number of queries from 87.115.109.190 from the UK.

I’m going to try implementing the mysql query_cache and thread_cache as recommended by the MySQLTuner script. The following was added to the /etc/my.cnf file:

long_query_time = 2
query_cache_type = 1
query_cache_size = 16M
query_cache_limit = 1M
thread_cache_size = 4

Also the slow query log was set to 1 second to try and capture when the queries start slowing down.

Finally I’m going to add in WP-Cache to reduce the impact of WordPress. The site doesn’t change very much, so it should be worth it.

02-Jan-2010 11:12:17

A host tracker message picked up that the site had gone down.

It was a ‘too many connections’ problem that seemed to start at 11:01 from some slow wordpress queries:

use wordpress;
UPDATE wp_options SET option_value = '1262430076' WHERE option_name = '_transient_doing_cron';
SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes';

I found people having similar problems in the wordpress forums.

There appeared to be a possible sendmail attack going on with these processes appearing in “ps -ef”:

sendmail: rejecting connections on daemon MTA: load average: 123
sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue

Don’t know if that’s the case but have shut down the sendmail daemon for now.

At the time when the queries began to slow down there was another hacker looking for phpMyAdmin on our site. The IP Address was 187.146.179.97 which is is Mexico. It logged about 45 requests in 10 seconds. This has happened before (see above) but does not make much sense as to why it should slow the whole server down.

03-Jan-10 09:59:23

The website went down with too many connections. It briefly got itself up again but crashed again soon after.

Attempting to stop the apache daemon (service httpd stop) resulted in the service not shutting down properly. As a result the command “service httpd start” would fail. There was the following error in the /var/log/httpd/error_log file:

[crit] (28)No space left on device: mod_rewrite: could not create rewrite_log_lock

A google search for the error found that it was leftover semaphores owned by apache that needed to be deleted using the following command:

for i in `ipcs -s | grep apache | awk '{print $2}'` ; do ipcrm -s $i; done

The command finds all the semaphore ids and does a ‘ipcrm -s’ on each id (which should delete the semaphore).

I defragmented the tables as suggested in the wordpress forums using the script found on 6-Nov-2009. Also I decreased the max connections and increased the query_cache_size in /etc/my.cnf.

This appeared to be related to approx 100 hits in 10 seconds from a Saudi Arabian forum spammer over multiple IP Addresses, e.g. 212.138.113.13, 212.138.69.17. Most were listed on the http://www.stopforumspam.com website.

27-Jan-2010 04:59:17

This was the first few error messages:

[27-Jan-2010 04:59:17] WordPress database error MySQL server has gone away for query UPDATE `wp_options` SET `option_value` = '1264568325' WHERE `option_name` = '_transient_doing_cron' made by require, require_once, require_once, require_once, do_action, call_user_func_array, wp_cron, spawn_cron, set_transient, update_option
[27-Jan-2010 04:59:17] WordPress database error MySQL server has gone away for query UPDATE `wp_options` SET `option_value` = '1264568318' WHERE `option_name` = '_transient_doing_cron' made by require, require_once, require_once, require_once, do_action, call_user_func_array, wp_cron, spawn_cron, set_transient, update_option
[27-Jan-2010 04:59:22] WordPress database error MySQL server has gone away for query SELECT t.*, tt.* FROM wp_terms AS t INNER JOIN wp_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy = 'category' AND t.slug = 'php-my-admin' LIMIT 1 made by require, require_once, include, get_header, locate_template, load_template, require_once, wp_title, get_term_by

Interestingly again there were numerous references to php-my-admin (n.b. t.slug = ‘php-my-admin’ above) as somebody else tried to guess the URL of it.

This was the access log at the time:

187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /dbadmin/ HTTP/1.1" 404 9634 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /mysql-admin/ HTTP/1.1" 404 9635 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /mysqlmanager/ HTTP/1.1" 404 9634 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /phpMyAdmin-2.5.4/ HTTP/1.1" 404 26994 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /phpMyAdmin-2.5.5-pl1/ HTTP/1.1" 404 9636 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /phpMyAdmin/ HTTP/1.1" 404 9634 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /phpMyAdmin2/ HTTP/1.1" 404 9636 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /phpmanager/ HTTP/1.1" 404 9636 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:30 +0000] "GET /sqlweb/ HTTP/1.1" 404 9634 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:33 +0000] "GET /phpMyAdmin-2.5.7/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:33 +0000] "GET /phpMyAdmin-2.6.0-beta1/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:33 +0000] "GET /phpMyAdmin-2.6.0-rc3/ HTTP/1.1" 404 9635 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:34 +0000] "GET /phpMyAdmin-2.6.0-pl3/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:35 +0000] "GET /phpMyAdmin-2.6.1-pl1/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:35 +0000] "GET /phpMyAdmin-2.6.2-beta1/ HTTP/1.1" 404 9636 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:35 +0000] "GET /phpMyAdmin-2.6.3/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:35 +0000] "GET /phpMyAdmin-2.6.4-rc1/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:36 +0000] "GET /phpMyAdmin-2.6.4-pl4/ HTTP/1.1" 404 9636 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:36 +0000] "GET /phpMyAdmin-2.7.0-pl1/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:36 +0000] "GET /phpMyAdmin-2.8.0-rc1/ HTTP/1.1" 404 9636 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:36 +0000] "GET /phpMyAdmin-2.8.0.2/ HTTP/1.1" 500 358 "-" "-"
187.146.181.169 - - [27/Jan/2010:04:56:36 +0000] "GET /phpMyAdmin-2.8.1/ HTTP/1.1" 500 358 "-" "-"

Note that the first 9 receive 404 not found errors then the database crashes and most of the rest give 500 internal server errors. Further to this they were actually using two IP addresses at the same time, resulting in 60+ attempts in 7 seconds. Clearly 10 per second is too much for the database.

As a result of this I have put in place a /spam-pot/ directory that has a blank php page that simply logs a hacking attempt. I have then built a list of apache Aliases that point to this directory based on the 700-odd hacking attempts logged between 18th Dec 2009 – 17 Feb 2010.

This should prevent these hacking attempts from hitting the database which is the main problem.

This is based off the idea of the project honey pot site.

04-Feb-2010 11:00:34

This was a very temporary error that occurred.

[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT *    FROM wp_links  WHERE 1=1 AND link_visible = 'Y'     ORDER BY link_name ASC made by require, require_once, include, get_header, locate_template, load_template, require_once, suckerfish, wp_list_bookmarks, get_bookmarks
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT * FROM wp_users WHERE ID = 5 LIMIT 1 made by require, require_once, include, the_post, WP_Query->the_post, setup_postdata, get_userdata
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT * FROM wp_users WHERE ID = 5 LIMIT 1 made by require, require_once, include, edit_post_link, get_edit_post_link, current_user_can, call_user_func_array, WP_User->has_cap, call_user_func_array, map_meta_cap, get_userdata
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT term_ID FROM wp_terms WHERE name='Asides' made by require, require_once, include, get_sidebar, locate_template, load_template, require_once
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT term_ID FROM wp_terms WHERE name='Featured' made by require, require_once, include, get_sidebar, locate_template, load_template, require_once
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT t.term_id, tt.parent, tt.count FROM wp_terms AS t INNER JOIN wp_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy IN ('category')  ORDER BY t.name ASC  made by require, require_once, include, get_sidebar, locate_template, load_template, require_once, wp_list_categories, get_categories, get_terms, get_terms
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT t.*, tt.* FROM wp_terms AS t INNER JOIN wp_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy IN ('category')  AND ( t.term_id <> 0  AND t.term_id <> 0 ) ORDER BY t.name ASC  made by require, require_once, include, get_sidebar, locate_template, load_template, require_once, wp_list_categories, get_categories, get_terms
[04-Feb-2010 11:00:34] WordPress database error MySQL server has gone away for query SELECT * FROM wp_users WHERE ID = 5 LIMIT 1 made by require, require_once, include, get_sidebar, locate_template, load_template, require_once, the_post, WP_Query->the_post, setup_postdata, get_userdata
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s