Keeping your server alive with Monit

NOTE: This is super web/development/sysadmin stuff, casual non website people should probably check this instead

Lately GV has been pretty out of control and our server has been crashing way to regularly due to too many visitors or bots. I’ve been working to find all the little holes in the Apache/PHP/MySQL configurations that are causing the crashes when load gets high, but it’s impossible while you’re constantly putting out fires and restarting the servers manually.

monit logoI’ve been having frustrating fun with a tool called Monit that helps stop your server from completely crashing by watching it’s system stats and selectively restarting processes or executing whatever command you want. It installs pretty easily on Linux servers (I think it’s in both Yum for CentOS/RH and in apt for Debian/Ubuntu) and it uses text files similar to Apache to set up different status conditions and what to do. The configuration took me awhile to get right, but once the percentages were tuned based on watching it for awhile it has kept the server from crashing even once for more than a minute despite some record traffic related to our Mumbai coverage. If it weren’t for Monit I’d probably still be getting calls in the middle of the night saying the site was down.

You still need to find the bugs in your server configuration, or move to more powerful hardware (what we’re doing), but even if its annoying that the apache needs to be restarted every few minutes in order to not crash the server, its better than having it crash randomly when you’re not around. While you’re still tuning the system, you can have it email you based on certain conditions, so you can see how often a certain status is reached and determine whether a restart is necessary. The manual explains the functions pretty well and isn’t too long.

My advice if you’re setting it up

Setting the reset/exec levels

If you’re setting it up for the first time and you’re not having any problems at the moment you should be careful not to set the percentages too high, or the server might crash before it got that bad. I was giving a Memory/RAM max around 80%, but the remaining 20% didn’t seem to be enough to save the server, it was already to late. Here are my settings for our Apache webserver:

check system server1.globalvoicesonline.org
if loadavg (1min) > 5 for 2 cycles then exec "/etc/init.d/httpd restart"
if loadavg (1min) > 7 for 1 cycles then exec "/etc/init.d/httpd restart"
if memory usage > 65% for 3 cycles then exec "/etc/init.d/httpd restart"
if memory usage > 75% then exec "/etc/init.d/httpd restart"

The ‘exec’ action is running the apache restart command directly, which will clear out all appache processes and restart them, freeing up RAM temporarily. I’m also running two levels of Load checking, which will measure the strain on the CPU. Together these cover a lot of situations that result in crashes, and there are two versions of each, one for bad situations that have gone on for awhile (“for 3 cycles” i.e. 3 minutes) and one for terrible situations that are seen even once (“for 1 cycles”, which is actually unnecessary to write).

You can actually also set up monitoring of specific processes like Apache or other servers, but its been a lot buggier (thinks the program’s not running when it is) for me than the raw server statistics, so use at your own risk.

Alternate Email Formatting for Monit

The default email format template that comes with Monit is pretty hard to read to the point where it’s kind of maddening to recieve messages from it. Luckily they offer a custom mail formatting api so you can make one that makes sense for you. The pieces they give you are a bit limiting but I worked out one that is very short and clear and should even work okay as an sms:

set mail-format {
from: monit@yourserver.org
subject: [$ACTION] $EVENT on $SERVICE
message: $DESCRIPTION
– – – – – – – – – – – – – – – – – –
Action: [$ACTION] at $DATE from $HOST

–monit
}

Which sends you emails like:

[exec] Resource limit matched for server2.globalvoicesonline.org
‘server.yourdomain.org’ mem usage of 71.9% matches resource limit [mem usage>65.0%]
– – – – – – – – – – – – – – – – – –
Action: [exec] at Fri, 28 Nov 2008 19:35:22 -0500 from server.yourdomain.org

–monit

Which I think is a lot better than the default. Any Monit users out there with a good format I’d love to see what else you’ve come up with.

Leave a Reply