Differences

This shows you the differences between two versions of the page.

--- raid [2009/11/17 05:59] – 172.26.0.166
+++ raid [2010/09/19 23:58] (current) – aorth
@@ Line 2: / Line 2: @@
 We have two RAIDs on the HPC
   * Linux kernel software RAID
-  * 3mware hardware RAID
+  * 3ware hardware RAID
 ==== Drive numbering ====
@@ Line 50: / Line 50: @@
 unused devices: <none></code>
 ==== Repair RAID ====
-When a disk is failing you might see errors in the system logs from smartd like this:
+When a disk is failing you need to replace the drive.  First, look at the RAID configuration to see which partitions are in use by which arrays.  For example:
-<file>Device: /dev/hda, 1 Offline uncorrectable sectors</file>
-In that case you need to replace the drive.  First, look at the RAID configuration to see which partitions are in use by which arrays.  For example:
 <code># cat /proc/mdstat
 Personalities : [raid1] [raid0]
@@ Line 74: / Line 71: @@
 unused devices: <none></code>
-Because it is ''/dev/hda'' which was having problems, set all its RAID1 partitions as failed and remove them:
+If ''/dev/hda'' is having problems, set all its RAID1 partitions as failed and remove them:
 <code># mdadm /dev/md0 --fail /dev/hda2 --remove /dev/hda2
 # mdadm /dev/md1 --fail /dev/hda3 --remove /dev/hda3
 # mdadm /dev/md3 --fail /dev/hda1 --remove /dev/hda1
 # mdadm /dev/md4 --fail /dev/hda6 --remove /dev/hda6</code>
+''/dev/md2'' is a RAID0 stripe mounted as ''/scratch'', so we have to umount it and then stop it (you can't remove volumes from a stripe):
+<code># umount /dev/md2
+# mdadm --stop /dev/md2</code>
 <note warning> You must Shutdown the server before you physically remove the drive! </note>
 Shut the server down and replace the faulty drive with a new one.  After booting your drive letters may have shifted around, so just be sure to verify which is which before proceeding.
@@ Line 91: / Line 91: @@
 /dev/sdc: msdos partitions 1
 </code>
-You can now add the new partitions back to the arrays:
+Re-create the scratch partition (RAID0):
+<code># mdadm --create --verbose /dev/md2 --level=0 --raid-devices=2 /dev/hda5 /dev/hdd5
+# mkfs.ext3 /dev/md2
+# mount /dev/md2 /scratch</code>
+You can now add the new partitions back to the RAID1 arrays:
 <code># mdadm /dev/md0 --add /dev/hdd2
 # mdadm /dev/md1 --add /dev/hdd3
@@ Line 116: / Line 120: @@
 unused devices: <none></file>
-Clearing any previous raid info on a disk (eg. reusing a disk from another decommissioned raid array)
-# mdadm --zero-superblock /dev/hdc1
-Adding a disk to an array
-# mdadm --add /dev/md0 /dev/hdc1
-=== To Do list: ===
-Prepare written instructions on how to repair disk arrays.
-What disks to we have?
-Add extra spare disks?
-How do you know which physical disk is broken to replace it?
-f
 ===== Hardware RAID =====
-A 3ware 9500S SATA RAID card using the 3w-9xxx kernel module.  It has 12 channels.  The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID.
+A 3ware 9500S-12 SATA RAID card using the 3w-9xxx kernel module.  It has 12 channels.  The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID.
 ==== Physical Disk Layout ====
-We have one RAID controller, 'c1'.  Disks are plugged into ports, 'p1' - 'p11'.  The disks are then grouped into units (basically the rows), 'u0' - 'u2'.
+We have one RAID controller, 'c1'.  Disks are plugged into ports, 'p0' - 'p11'.  The disks are then grouped into units (basically the rows), 'u0' - 'u2'.
 | Port 8 | Port 9 | Port 10 | Port 11 |
@@ Line 151: / Line 134: @@
 ==== Repairing 'degraded' arrays ====
-There is a utility, tw_cli, which can be used to control/monitor the hardware raid controller.
+There is a utility, ''tw_cli'', which can be used to control/monitor the hardware raid controller.
 Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for:
@@ Line 178: / Line 161: @@
 w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3</file>
 <code>$ sudo tw_cli
 Password:
 //hpc-ilri> /c1 show