Differences

This shows you the differences between two versions of the page.

--- raid [2009/09/29 15:01] – 172.26.0.166
+++ raid [2010/05/22 14:19] – external edit 127.0.0.1
@@ Line 5: / Line 5: @@
 ==== Drive numbering ====
-If you're looking at the front of the HPC you'll see four rows of drives.  From the bottom;
+If you're looking at the front of the HPC you'll see four rows of drives.  From the bottom:
   * Rows 0 - 2 are SATA, connected to the hardware 3ware RAID card
   * Row 3 are IDE
@@ Line 18: / Line 18: @@
 /dev/md3 on /boot type ext3 (rw)
 /dev/md2 on /scratch type ext3 (rw)
-/dev/md1 on /export type ext3 (rw)</code>
+/dev/md1 on /export type ext3 (rw)
+# df -h | grep md
+/dev/md0               29G   11G   17G  39% /
+/dev/md3              190M   60M  121M  34% /boot
+/dev/md2               35G  177M   33G   1% /scratch
+/dev/md1               25G  5.5G   18G  24% /export</code>
 It should be noted that ''/dev/md4'' is being used as swap:
@@ Line 27: / Line 32: @@
 A snapshot of the software RAID's health:
+<code># cat /proc/mdstat
+Personalities : [raid1] [raid0]
+md3 : active raid1 hdd1[1] hda1[0]
+blocks [2/2] [UU]
+md1 : active raid1 hdd3[1] hda3[0]
+      26627648 blocks [2/2] [UU]
+md2 : active raid0 hdd5[1] hda5[0]
+      36868608 blocks 256k chunks
+md4 : active raid1 hdd6[1] hda6[0]
+      2168640 blocks [2/2] [UU]
+md0 : active raid1 hdd2[1] hda2[0]
+      30716160 blocks [2/2] [UU]
+unused devices: <none></code>
+==== Repair RAID ====
+When a disk is failing you need to replace the drive.  First, look at the RAID configuration to see which partitions are in use by which arrays.  For example:
 <code># cat /proc/mdstat
 Personalities : [raid1] [raid0]
@@ Line 46: / Line 71: @@
 unused devices: <none></code>
-=== To Do list: ===
+If ''/dev/hda'' is having problems, set all its RAID1 partitions as failed and remove them:
+<code># mdadm /dev/md0 --fail /dev/hda2 --remove /dev/hda2
+# mdadm /dev/md1 --fail /dev/hda3 --remove /dev/hda3
+# mdadm /dev/md3 --fail /dev/hda1 --remove /dev/hda1
+# mdadm /dev/md4 --fail /dev/hda6 --remove /dev/hda6</code>
+''/dev/md2'' is a RAID0 stripe mounted as ''/scratch'', so we have to umount it and then stop it (you can't remove volumes from a stripe):
+<code># umount /dev/md2
+# mdadm --stop /dev/md2</code>
+<note warning> You must Shutdown the server before you physically remove the drive! </note>
+Shut the server down and replace the faulty drive with a new one.  After booting your drive letters may have shifted around, so just be sure to verify which is which before proceeding.
+Clone the partition table from the good drive to the bad one:
+<code># sfdisk -d /dev/hda | sfdisk --force /dev/hdd</code>
+Verify the new partitions can be seen:
+<code># partprobe -s
+/dev/hda: msdos partitions 1 2 3 4 <5 6>
+/dev/hdd: msdos partitions 1 2 3 4 <5 6>
+/dev/sda: msdos partitions 1
+/dev/sdb: msdos partitions 1
+/dev/sdc: msdos partitions 1
+</code>
+Re-create the scratch partition (RAID0):
+<code># mdadm --create --verbose /dev/md2 --level=0 --raid-devices=2 /dev/hda5 /dev/hdd5
+# mkfs.ext3 /dev/md2
+# mount /dev/md2 /scratch</code>
+You can now add the new partitions back to the RAID1 arrays:
+<code># mdadm /dev/md0 --add /dev/hdd2
+# mdadm /dev/md1 --add /dev/hdd3
+# mdadm /dev/md3 --add /dev/hdd1
+# mdadm /dev/md4 --add /dev/hdd6</code>
+After adding you can monitor the progress of the RAID rebuilds by looking in ''/proc/mdstat'':
+<file>Personalities : [raid1] [raid0]
+md3 : active raid1 hdd1[1] hda1[0]
+blocks [2/2] [UU]
+md1 : active raid1 hdd3[2] hda3[0]
+      26627648 blocks [2/1] [U_]
+      [===================>.]  recovery = 95.4% (25407552/26627648) finish=0.7min speed=28648K/sec
+md2 : inactive hda5[0]
+      18434304 blocks
+md4 : active raid1 hdd6[2] hda6[0]
+      2168640 blocks [2/1] [U_]
+        resync=DELAYED
+md0 : active raid1 hdd2[1] hda2[0]
+      30716160 blocks [2/2] [UU]
+unused devices: <none></file>
+===== Hardware RAID =====
-Prepare written instructions on how to repair disk arrays.
+A 3ware 9500S SATA RAID card using the 3w-9xxx kernel module.  It has 12 channels.  The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID.
-What disks to we have?
+==== Physical Disk Layout ====
-Add extra spare disks?
+We have one RAID controller, 'c1'.  Disks are plugged into ports, 'p1' - 'p11'.  The disks are then grouped into units (basically the rows), 'u0' - 'u2'.
-How do you know which physical disk is broken to replace it?
+| Port 8 | Port 9 | Port 10 | Port 11 |
+| Port 4 | Port 5 | Port 6 | Port 7 |
+| Port 0 | Port 1 | Port 2 | Port 3 |
+==== Repairing 'degraded' arrays ====
-===== Hardware RAID =====
+There is a utility, tw_cli, which can be used to control/monitor the hardware raid controller.
-There is a utility, tw_cli, which can be used to control the hardware raid.  The hardware RAID has three arrays, all RAID 5.  Each "unit" (row) is one array.
-| 8 | 9 | 10 | 11 |
-| 4 | 5 | 6 | 7 |
-| 0 | 1 | 2 | 3 |
 Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for:
   * Which controller is active? (c0, c1, etc)
   * Which unit is degraded? (u0, u1, u2, etc)
-  * Which
+  * Which port is inactive or missing? (p1, p5, etc)
+<note warning>The controller supports hot swapping but you **must** remove a faulty drive through the ''tw_cli'' tool before you can swap drives.</note>
 Remove the faulty port:
@@ Line 77: / Line 150: @@
 Rebuild the degraded array:
 <code>maint rebuild c1 u2 p5</code>
-Check the status of the rebuild by monitoring ''/c1 show''
+Check the status of the rebuild by monitoring ''/c1 show'', but I have a feeling this might disturb the rebuild process.  In any case, you can check the status by following the output of ''dmesg'':
+<file>3w-9xxx: scsi1: AEN: INFO (0x04:0x000B): Rebuild started:unit=2.
+w-9xxx: scsi1: AEN: INFO (0x04:0x0005): Background rebuild done:unit=2.</file>
+This sucks:
+<file>3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Background verify started:unit=0.
+w-9xxx: scsi1: AEN: INFO (0x04:0x002B): Background verify done:unit=0.
+w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3</file>
+<code>$ sudo tw_cli
+Password:
+//hpc-ilri> /c1 show
+Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
+------------------------------------------------------------------------------
+u0    RAID-5    DEGRADED       -       -       64K     698.461   ON     OFF
+u1    RAID-5    OK             -       -       64K     698.461   ON     OFF
+u2    RAID-5    OK             -       -       64K     698.461   ON     OFF
+Port   Status           Unit   Size        Blocks        Serial
+---------------------------------------------------------------
+p0     OK               u0     232.88 GB   488397168     WD-WMAEP2714804
+p1     OK               u0     232.88 GB   488397168     WD-WMAEP1570106
+p2     OK               u0     232.88 GB   488397168     WD-WMAEP2712887
+p3     DEGRADED         u0     232.88 GB   488397168     WD-WMAEP2714418
+p4     OK               u2     232.88 GB   488397168     WD-WCAT1C715001
+p5     OK               u2     232.88 GB   488397168     WD-WMAEP2713449
+p6     OK               u2     232.88 GB   488397168     WD-WMAEP2715070
+p7     OK               u2     232.88 GB   488397168     WD-WMAEP2712590
+p8     OK               u1     232.88 GB   488397168     WD-WMAEP2712574
+p9     OK               u1     232.88 GB   488397168     WD-WMAEP2734142
+p10    OK               u1     232.88 GB   488397168     WD-WMAEP2702155
+p11    OK               u1     232.88 GB   488397168     WD-WMAEP2712472  </code>
+Looks like another drive failed.