This is an old revision of the document!
Table of Contents
RAID
We have two RAIDs on the HPC
- Linux kernel software RAID
- 3mware hardware RAID
Drive numbering
If you're looking at the front of the HPC you'll see four rows of drives. From the bottom:
- Rows 0 - 2 are SATA, connected to the hardware 3ware RAID card
- Row 3 are IDE
Software RAID
The Linux kernel has the md
(mirrored devices) driver for software RAID devices. There are currently two 80 GB IDE hard drives connected to the server, /dev/hda
and /dev/hdd
. These were set up as five RAID devices during the install of Rocks/CentOS.
Here is information on their configuration:
# mount | grep md /dev/md0 on / type ext3 (rw) /dev/md3 on /boot type ext3 (rw) /dev/md2 on /scratch type ext3 (rw) /dev/md1 on /export type ext3 (rw) # df -h | grep md /dev/md0 29G 11G 17G 39% / /dev/md3 190M 60M 121M 34% /boot /dev/md2 35G 177M 33G 1% /scratch /dev/md1 25G 5.5G 18G 24% /export
It should be noted that /dev/md4
is being used as swap:
# swapon -s Filename Type Size Used Priority /dev/md4 partition 2168632 0 -1
A snapshot of the software RAID's health:
# cat /proc/mdstat Personalities : [raid1] [raid0] md3 : active raid1 hdd1[1] hda1[0] 200704 blocks [2/2] [UU] md1 : active raid1 hdd3[1] hda3[0] 26627648 blocks [2/2] [UU] md2 : active raid0 hdd5[1] hda5[0] 36868608 blocks 256k chunks md4 : active raid1 hdd6[1] hda6[0] 2168640 blocks [2/2] [UU] md0 : active raid1 hdd2[1] hda2[0] 30716160 blocks [2/2] [UU] unused devices: <none>
Repair RAID
Setting a disk faulty/failed:
# mdadm –fail /dev/md0 /dev/hdc1
DO NOT run this every on a raid0 or linear device or your data is toasted!
Removing a faulty disk from an array:
# mdadm –remove /dev/md0 /dev/hdc1 Clearing any previous raid info on a disk (eg. reusing a disk from another decommissioned raid array)
# mdadm –zero-superblock /dev/hdc1 Adding a disk to an array
# mdadm –add /dev/md0 /dev/hdc1
To Do list:
Prepare written instructions on how to repair disk arrays.
What disks to we have?
Add extra spare disks?
How do you know which physical disk is broken to replace it?
Hardware RAID
A 3ware 9500S SATA RAID card using the 3w-9xxx kernel module. It has 12 channels. The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID.
Physical Disk Layout
We have one RAID controller, 'c1'. Disks are plugged into ports, 'p1' - 'p11'. The disks are then grouped into units (basically the rows), 'u0' - 'u2'.
Port 8 | Port 9 | Port 10 | Port 11 |
Port 4 | Port 5 | Port 6 | Port 7 |
Port 0 | Port 1 | Port 2 | Port 3 |
Repairing 'degraded' arrays
There is a utility, tw_cli, which can be used to control/monitor the hardware raid controller.
Study the output of show
to know which controller to manage. Then you can use /c1 show
to show the status of that particular controller. Things to look for:
- Which controller is active? (c0, c1, etc)
- Which unit is degraded? (u0, u1, u2, etc)
- Which port is inactive or missing? (p1, p5, etc)
<note warning>The controller supports hot swapping but you must remove a faulty drive through the tw_cli
tool before you can swap drives.</note>
Remove the faulty port:
maint remove c1 p5
Insert a new drive and rescan:
maint rescan
Rebuild the degraded array:
maint rebuild c1 u2 p5
Check the status of the rebuild by monitoring /c1 show
, but I have a feeling this might disturb the rebuild process. In any case, you can check the status by following the output of dmesg
:
3w-9xxx: scsi1: AEN: INFO (0x04:0x000B): Rebuild started:unit=2. 3w-9xxx: scsi1: AEN: INFO (0x04:0x0005): Background rebuild done:unit=2.
This sucks:
3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Background verify started:unit=0. 3w-9xxx: scsi1: AEN: INFO (0x04:0x002B): Background verify done:unit=0. 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3
$ sudo tw_cli Password: //hpc-ilri> /c1 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 DEGRADED - - 64K 698.461 ON OFF u1 RAID-5 OK - - 64K 698.461 ON OFF u2 RAID-5 OK - - 64K 698.461 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 232.88 GB 488397168 WD-WMAEP2714804 p1 OK u0 232.88 GB 488397168 WD-WMAEP1570106 p2 OK u0 232.88 GB 488397168 WD-WMAEP2712887 p3 DEGRADED u0 232.88 GB 488397168 WD-WMAEP2714418 p4 OK u2 232.88 GB 488397168 WD-WCAT1C715001 p5 OK u2 232.88 GB 488397168 WD-WMAEP2713449 p6 OK u2 232.88 GB 488397168 WD-WMAEP2715070 p7 OK u2 232.88 GB 488397168 WD-WMAEP2712590 p8 OK u1 232.88 GB 488397168 WD-WMAEP2712574 p9 OK u1 232.88 GB 488397168 WD-WMAEP2734142 p10 OK u1 232.88 GB 488397168 WD-WMAEP2702155 p11 OK u1 232.88 GB 488397168 WD-WMAEP2712472
Looks like another drive failed.