User Tools

Site Tools


raid

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
raid [2009/09/29 14:54] 172.26.0.166raid [2009/11/16 13:59] 172.26.0.166
Line 3: Line 3:
   * Linux kernel software RAID   * Linux kernel software RAID
   * 3mware hardware RAID   * 3mware hardware RAID
 +==== Drive numbering ====
  
-| 1 | | 3 | +If you're looking at the front of the HPC you'll see four rows of drives.  From the bottom: 
-| 8 | 9 | 10 | 11 | +  * Rows are SATA, connected to the hardware 3ware RAID card 
-| 4 | 5 | 6 | 7 | +  * Row are IDE
-| 0 | 1 | 2 | |+
  
 ===== Software RAID ===== ===== Software RAID =====
 +The Linux kernel has the ''md'' (mirrored devices) driver for software RAID devices.  There are currently two 80 GB IDE hard drives connected to the server, ''/dev/hda'' and ''/dev/hdd'' These were set up as five RAID devices during the install of Rocks/CentOS.
  
-It is currently reporting a degraded array (27 Aug 2009):+Here is information on their configuration:
  
-<code>#cat /proc/mdstat +<code># mount | grep md 
-Personalities : [raid0] [raid1]  +/dev/md0 on / type ext3 (rw) 
-md1 : active raid1 hda1[0] +/dev/md3 on /boot type ext3 (rw) 
-      129920 blocks [2/1] [U_]+/dev/md2 on /scratch type ext3 (rw) 
 +/dev/md1 on /export type ext3 (rw) 
 +# df -h | grep md 
 +/dev/md0               29G   11G   17G  39% / 
 +/dev/md3              190M   60M  121M  34% /boot 
 +/dev/md2               35G  177M   33G   1% /scratch 
 +/dev/md1               25G  5.5G   18G  24% /export</code> 
 + 
 +It should be noted that ''/dev/md4'' is being used as swap: 
 +<code># swapon -s 
 +Filename Type Size Used Priority 
 +/dev/md4                                partition 2168632 0 -1</code> 
 + 
 +A snapshot of the software RAID's health: 
 + 
 +<code># cat /proc/mdstat  
 +Personalities : [raid1] [raid0]  
 +md3 : active raid1 hdd1[1] hda1[0] 
 +      200704 blocks [2/2] [UU]
              
-md3 : active raid1 hdc3[1] hda3[0] +md1 : active raid1 hdd3[1] hda3[0] 
-      2097024 blocks [2/2] [UU]+      26627648 blocks [2/2] [UU]
              
-md2 : active raid1 hdc5[1] +md2 : active raid0 hdd5[1] hda5[0
-      65437696 blocks [2/1] [_U]+      36868608 blocks 256k chunks
              
-md0 : active raid0 hdc2[1] hda2[0] +md4 : active raid1 hdd6[1] hda6[0] 
-      20971008 blocks 256k chunks+      2168640 blocks [2/2] [UU] 
 +       
 +md0 : active raid1 hdd2[1] hda2[0] 
 +      30716160 blocks [2/2] [UU]
              
 unused devices: <none></code> unused devices: <none></code>
 +
 +==== Repair RAID ====
 +
 +Setting a disk faulty/failed:
 +
 +# mdadm --fail /dev/md0 /dev/hdc1
 +
 +DO NOT run this every on a raid0 or linear device or your data is toasted!
 +
 +Removing a faulty disk from an array:
 +
 +# mdadm --remove /dev/md0 /dev/hdc1
 +Clearing any previous raid info on a disk (eg. reusing a disk from another decommissioned raid array)
 +
 +# mdadm --zero-superblock /dev/hdc1
 +Adding a disk to an array
 +
 +# mdadm --add /dev/md0 /dev/hdc1
 +
  
 === To Do list: === === To Do list: ===
Line 43: Line 84:
 ===== Hardware RAID ===== ===== Hardware RAID =====
  
-There is a utility, tw_cli, which can be used to control the hardware raid.  The hardware RAID has three arrays, all RAID 5.  Each "unit" (row) is one array.+A 3ware 9500S SATA RAID card using the 3w-9xxx kernel module.  It has 12 channels.  The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID. 
 + 
 +==== Physical Disk Layout ==== 
 + 
 +We have one RAID controller, 'c1' Disks are plugged into ports, 'p1' - 'p11' The disks are then grouped into units (basically the rows), 'u0' - 'u2'
 + 
 +| Port 8 | Port 9 | Port 10 | Port 11 | 
 +| Port 4 | Port 5 | Port 6 | Port 7 | 
 +| Port 0 | Port 1 | Port 2 | Port 3 | 
 + 
 +==== Repairing 'degraded' arrays ==== 
 + 
 +There is a utility, tw_cli, which can be used to control/monitor the hardware raid controller.
  
 Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for: Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for:
   * Which controller is active? (c0, c1, etc)   * Which controller is active? (c0, c1, etc)
   * Which unit is degraded? (u0, u1, u2, etc)   * Which unit is degraded? (u0, u1, u2, etc)
-  * Which +  * Which port is inactive or missing? (p1, p5, etc) 
 + 
 +<note warning>The controller supports hot swapping but you **must** remove a faulty drive through the ''tw_cli'' tool before you can swap drives.</note>
  
 Remove the faulty port: Remove the faulty port:
Line 56: Line 111:
 Rebuild the degraded array: Rebuild the degraded array:
 <code>maint rebuild c1 u2 p5</code> <code>maint rebuild c1 u2 p5</code>
-Check the status of the rebuild by monitoring ''/c1 show''+ 
 +Check the status of the rebuild by monitoring ''/c1 show'', but I have a feeling this might disturb the rebuild process.  In any case, you can check the status by following the output of ''dmesg'': 
 + 
 +<file>3w-9xxx: scsi1: AEN: INFO (0x04:0x000B): Rebuild started:unit=2. 
 +3w-9xxx: scsi1: AEN: INFO (0x04:0x0005): Background rebuild done:unit=2.</file> 
 + 
 +This sucks: 
 + 
 +<file>3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Background verify started:unit=0. 
 +3w-9xxx: scsi1: AEN: INFO (0x04:0x002B): Background verify done:unit=0. 
 +3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3</file> 
 + 
 +<code>$ sudo tw_cli  
 +Password:  
 +//hpc-ilri> /c1 show 
 + 
 +Unit  UnitType  Status         %RCmpl  %V/I/ Stripe  Size(GB)  Cache  AVrfy 
 +------------------------------------------------------------------------------ 
 +u0    RAID-5    DEGRADED                   64K     698.461   ON     OFF     
 +u1    RAID-5    OK                         64K     698.461   ON     OFF     
 +u2    RAID-5    OK                         64K     698.461   ON     OFF     
 + 
 +Port   Status           Unit   Size        Blocks        Serial 
 +--------------------------------------------------------------- 
 +p0     OK               u0     232.88 GB   488397168     WD-WMAEP2714804      
 +p1     OK               u0     232.88 GB   488397168     WD-WMAEP1570106      
 +p2     OK               u0     232.88 GB   488397168     WD-WMAEP2712887      
 +p3     DEGRADED         u0     232.88 GB   488397168     WD-WMAEP2714418      
 +p4     OK               u2     232.88 GB   488397168     WD-WCAT1C715001      
 +p5     OK               u2     232.88 GB   488397168     WD-WMAEP2713449      
 +p6     OK               u2     232.88 GB   488397168     WD-WMAEP2715070      
 +p7     OK               u2     232.88 GB   488397168     WD-WMAEP2712590      
 +p8     OK               u1     232.88 GB   488397168     WD-WMAEP2712574      
 +p9     OK               u1     232.88 GB   488397168     WD-WMAEP2734142      
 +p10    OK               u1     232.88 GB   488397168     WD-WMAEP2702155      
 +p11    OK               u1     232.88 GB   488397168     WD-WMAEP2712472  </code> 
 + 
 +Looks like another drive failed.
raid.txt · Last modified: 2010/09/19 23:58 by aorth