Differences

This shows you the differences between two versions of the page.

--- raid [2009/08/27 09:14] – 172.26.0.166
+++ raid [2009/09/30 05:09] – 172.26.0.166
@@ Line 1: / Line 1: @@
-=== HPC RAID array ===
+===== RAID =====
-The storage on the HPC is using a RAID array. (Level?)
+We have two RAIDs on the HPC
+  * Linux kernel software RAID
+  * 3mware hardware RAID
+==== Drive numbering ====
-It is currently reporting a degraded array:
+If you're looking at the front of the HPC you'll see four rows of drives.  From the bottom;
+  * Rows 0 - 2 are SATA, connected to the hardware 3ware RAID card
+  * Row 3 are IDE
-<code>#cat /proc/mdstat
+===== Software RAID =====
-Personalities : [raid0] [raid1]
+The Linux kernel has the ''md'' (mirrored devices) driver for software RAID devices.  There are currently two 80 GB IDE hard drives connected to the server, ''/dev/hda'' and ''/dev/hdd''.  These were set up as five RAID devices during the install of Rocks/CentOS.
-md1 : active raid1 hda1[0]
- blocks [2/1] [U_]
+Here is information on their configuration:
+<code># mount | grep md
+/dev/md0 on / type ext3 (rw)
+/dev/md3 on /boot type ext3 (rw)
+/dev/md2 on /scratch type ext3 (rw)
+/dev/md1 on /export type ext3 (rw)
+# df -h | grep md
+/dev/md0               29G   11G   17G  39% /
+/dev/md3              190M   60M  121M  34% /boot
+/dev/md2               35G  177M   33G   1% /scratch
+/dev/md1               25G  5.5G   18G  24% /export</code>
+It should be noted that ''/dev/md4'' is being used as swap:
+<code># swapon -s
+Filename				Type		Size	Used	Priority
+/dev/md4                                partition	2168632	0	-1</code>
+A snapshot of the software RAID's health:
+<code># cat /proc/mdstat
+Personalities : [raid1] [raid0]
+md3 : active raid1 hdd1[1] hda1[0]
+ blocks [2/2] [UU]
-md3 : active raid1 hdc3[1] hda3[0]
+md1 : active raid1 hdd3[1] hda3[0]
-      2097024 blocks [2/2] [UU]
+      26627648 blocks [2/2] [UU]
-md2 : active raid1 hdc5[1]
+md2 : active raid0 hdd5[1] hda5[0]
-      65437696 blocks [2/1] [_U]
+      36868608 blocks 256k chunks
-md0 : active raid0 hdc2[1] hda2[0]
+md4 : active raid1 hdd6[1] hda6[0]
-      20971008 blocks 256k chunks
+      2168640 blocks [2/2] [UU]
-unused devices: <none><code>#
+md0 : active raid1 hdd2[1] hda2[0]
+      30716160 blocks [2/2] [UU]
+unused devices: <none></code>
+=== To Do list: ===
+Prepare written instructions on how to repair disk arrays.
+What disks to we have?
+Add extra spare disks?
+How do you know which physical disk is broken to replace it?
+===== Hardware RAID =====
+There is a utility, tw_cli, which can be used to control the hardware raid.  The hardware RAID has three arrays, all RAID 5.  Each "unit" (row) is one array.
+| 8 | 9 | 10 | 11 |
+| 4 | 5 | 6 | 7 |
+| 0 | 1 | 2 | 3 |
+Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for:
+  * Which controller is active? (c0, c1, etc)
+  * Which unit is degraded? (u0, u1, u2, etc)
+  * Which
+Remove the faulty port:
+<code>maint remove c1 p5</code>
+Insert a new drive and rescan:
+<code>maint rescan</code>
+Rebuild the degraded array:
+<code>maint rebuild c1 u2 p5</code>
+Check the status of the rebuild by monitoring ''/c1 show'', but I have a feeling this might disturb the rebuild process.  In any case, you can check the status by following the output of ''dmesg'':
+<file>3w-9xxx: scsi1: AEN: INFO (0x04:0x000B): Rebuild started:unit=2.
+w-9xxx: scsi1: AEN: INFO (0x04:0x0005): Background rebuild done:unit=2.</file>
+This sucks:
+<file>3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Background verify started:unit=0.
+w-9xxx: scsi1: AEN: INFO (0x04:0x002B): Background verify done:unit=0.
+w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3</file>
+<code>$ sudo tw_cli
+Password:
+//hpc-ilri> /c1 show
+Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
+------------------------------------------------------------------------------
+u0    RAID-5    DEGRADED       -       -       64K     698.461   ON     OFF
+u1    RAID-5    OK             -       -       64K     698.461   ON     OFF
+u2    RAID-5    OK             -       -       64K     698.461   ON     OFF
+Port   Status           Unit   Size        Blocks        Serial
+---------------------------------------------------------------
+p0     OK               u0     232.88 GB   488397168     WD-WMAEP2714804
+p1     OK               u0     232.88 GB   488397168     WD-WMAEP1570106
+p2     OK               u0     232.88 GB   488397168     WD-WMAEP2712887
+p3     DEGRADED         u0     232.88 GB   488397168     WD-WMAEP2714418
+p4     OK               u2     232.88 GB   488397168     WD-WCAT1C715001
+p5     OK               u2     232.88 GB   488397168     WD-WMAEP2713449
+p6     OK               u2     232.88 GB   488397168     WD-WMAEP2715070
+p7     OK               u2     232.88 GB   488397168     WD-WMAEP2712590
+p8     OK               u1     232.88 GB   488397168     WD-WMAEP2712574
+p9     OK               u1     232.88 GB   488397168     WD-WMAEP2734142
+p10    OK               u1     232.88 GB   488397168     WD-WMAEP2702155
+p11    OK               u1     232.88 GB   488397168     WD-WMAEP2712472  </code>
+Looks like another drive failed.