Houston, is there a problem?
Recently, on a listserv I frequent, I heard a complaint from a sysadmin that "yet another software RAID mirror" had failed. The fellow complaining emphasized the fact that in the OS X Server boxes he managed, disks would fail and he wouldn't know until he checked in on them. He wondered if any hardware RAID cards featured notification services, rather than rely on the built-in RAID capabilities of OS X.
While an XServe RAID can notify you if a hard disk fails, as well as other hardware RAID subsystems, Disk Utility and OS X have no provisions for notifying you if a mirror set has a bad disk. A mirror is two hard drives acting as one, so that if one fails, you lose no data. You can set up a mirror via the diskutil command or in the Disk Utility application. PCI IDE RAID controllers don't feature notification like external subsystems, as the PCI bus provides no means of communicating the status of various drives or RAID sets. After all, a phone line or ethernet cable is required for the little computer than lives in the RAID chassis, so it can get your attention when it's hurting.
Fortunately, I had a solution at my fingertips.
I had run across this very same problem myself, a couple of years ago, so I hammered out my own little software solution, a shell script that runs as a cron job every hour. It simply checks the status of the RAID to make sure that both disks check out with an "OK" status, take a look at the output of the following command:
host2:~ mostadmin$ diskutil list
/dev/disk0
#: type name size identifier
0: Apple_partition_scheme *34.2 GB disk0
1: Apple_partition_map 31.5 KB disk0s1
2: Apple_Driver43 27.0 KB disk0s2
3: Apple_Driver43 37.0 KB disk0s3
4: Apple_Driver_IOKit 256.0 KB disk0s4
5: Apple_Patches 256.0 KB disk0s5
6: Apple_HFS images 34.2 GB disk0s6
/dev/disk1
#: type name size identifier
0: Apple_partition_scheme *17.0 GB disk1
1: Apple_partition_map 31.5 KB disk1s1
2: Apple_Driver_OpenFirmware 512.0 KB disk1s2
3: Apple_Boot_RAID 17.0 GB disk1s3
/dev/disk2
#: type name size identifier
0: Apple_partition_scheme *17.0 GB disk2
1: Apple_partition_map 31.5 KB disk2s1
2: Apple_Driver_OpenFirmware 512.0 KB disk2s2
3: Apple_Boot_RAID 17.0 GB disk2s3
/dev/disk3
#: type name size identifier
0: boot *17.0 GB disk3
host2:~ mostadmin$
The diskutil list command fetches the identifiers of all the hard disks, in this case, the mirror set we're concerned with is called "boot," which is the boot drive of this OS X Server.
Once you know the identifier of your RAID set, you can use the diskutil command again to check the status, what we're looking for here is "OK." If there's two OKs, everything's hunky-dory. Less than two, you've got a problem. If you've got zero, your server's a paperweight and it won't be spamming you.
host2:~ mostadmin$ diskutil checkRAID disk3
RAID SETS
---------
Name: boot
Unique ID: bootf5b49d82471e11d98b06003065be09be
Type: Mirror
Status: Running
Device Node: disk3
-------------------------------------------------------------
# Device Node Status
-------------------------------------------------------------
0 disk1 OK
1 disk2 OK
-------------------------------------------------------------
host2:~ mostadmin$
Now, for a nice little script I wrote to run as a cron job:
#!/bin/sh
## This script is designed to get the status of a mirror set and send an email notice on fail
## to test, change the "good=2" below to "good=1" and you should receive a warning email
## Dean Shavit, MOST Training & Consulting dean@macworkshops.com http://www.macworkshops.com
## This is for servers using software RAID mirror sets only
##
## Step 1: Define a variable for a functional raid by counting the number of good disks
status=`diskutil checkraid disk3|grep -c OK`
## Step 2: Define a number for comparison against a failed raid
good=2
## Step 3: Define the warning message for the body of the email
warning="houston we have a problem!"
## Step 4: Define a variable for the computer name
box=`/usr/sbin/scutil --get ComputerName`
## Step 5: Define a variable - email address of person to notify
admin="dean@macworkshops.com"
## Step 6: compare current status with good status, if a match, echo, if not, notify
if [ $status == $good ]; then
echo $status
else echo From: $box- $warning!!! > /tmp/houston.txt| mail -s "RAID Alert Report"\
$admin < /tmp/houston.txt
fi
This script will fire off an email if the number of OKs returned by diskutil checkRAID is any number other than 2.
Other Concerns
If you're running Jaguar Server, you'll need to make sure that the mail service is started. Panther or Panther server will use the Postfix-watch process to fire off the email, regardless of whether the email services are running.
Also, make sure that the hostname and/or DNS name of the server is set correctly, and that the email "from" address, which should be root@fqdn (fully qualified domain name) is going out a router or IP address which has a PTR (reverse lookup record) that matches the hostname of the server, otherwise some email servers (maybe yours) might not accept the message.
Here's the email the script generates (this is sent from my Powerbook, which has no RAID nor a proper fqdn):
Received: from ASSP-nospam ([192.168.0.84]) BY 192.168.0.84 ([192.168.0.84])
WITH ESMTP (4D WebSTAR V Mail (5.3.4)); Tue, 01 Feb 2005 01:05:38 -0600
Received: from 68.165.43.42 ([68.165.43.42] helo=minime.local) by ASSP-nospam ; 1 Feb 05 07:05:38 -0000
Received: by minime.local (Postfix, from userid 0)
id D56DB225D6F; Tue, 1 Feb 2005 01:09:47 -0600 (CST)
To: dean@macworkshops.com
Subject: RAID Alert Report
Message-Id: <20050201070947.D56DB225D6F@minime.local>
Date: Tue, 1 Feb 2005 01:09:47 -0600 (CST)
From: root@minime.local (System Administrator)
X-Assp-Spam-Prob: 0.00000
X-UID-FLAGS: r00001665-00000000000000000000000000000000
From: minime- houston we have a problem!!!!
I'm in the process of writing my March column for
MacTech Magazine which will feature this solution and many other free solutions for monitoring and repairing hard disks from the command line. If you don't have a subscription yet, get one! Better yet, consider attending one of our
workshops where you can learn to roll your own solutions like this script and much, much more. Until then, I hope I don't see you in Houston!
Dean Shavit is an
ACSA (Apple Certified System Administrator) who leads training sessions for
MOST (Mac OS Training & Consulting). If you have questions or feedback you can contact him at
dean@macworkshops.com