User Tools

Site Tools


raid

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
raid [2009/11/17 05:59] 172.26.0.166raid [2010/09/19 23:58] (current) aorth
Line 2: Line 2:
 We have two RAIDs on the HPC We have two RAIDs on the HPC
   * Linux kernel software RAID   * Linux kernel software RAID
-  * 3mware hardware RAID+  * 3ware hardware RAID
 ==== Drive numbering ==== ==== Drive numbering ====
  
Line 50: Line 50:
              
 unused devices: <none></code> unused devices: <none></code>
- 
 ==== Repair RAID ==== ==== Repair RAID ====
-When a disk is failing you might see errors in the system logs from smartd like this: +When a disk is failing you need to replace the drive.  First, look at the RAID configuration to see which partitions are in use by which arrays.  For example:
-<file>Device: /dev/hda, 1 Offline uncorrectable sectors</file> +
-In that case you need to replace the drive.  First, look at the RAID configuration to see which partitions are in use by which arrays.  For example:+
 <code># cat /proc/mdstat  <code># cat /proc/mdstat 
 Personalities : [raid1] [raid0]  Personalities : [raid1] [raid0] 
Line 74: Line 71:
 unused devices: <none></code> unused devices: <none></code>
  
-Because it is ''/dev/hda'' which was having problems, set all its RAID1 partitions as failed and remove them:+If ''/dev/hda'' is having problems, set all its RAID1 partitions as failed and remove them:
 <code># mdadm /dev/md0 --fail /dev/hda2 --remove /dev/hda2 <code># mdadm /dev/md0 --fail /dev/hda2 --remove /dev/hda2
 # mdadm /dev/md1 --fail /dev/hda3 --remove /dev/hda3 # mdadm /dev/md1 --fail /dev/hda3 --remove /dev/hda3
 # mdadm /dev/md3 --fail /dev/hda1 --remove /dev/hda1 # mdadm /dev/md3 --fail /dev/hda1 --remove /dev/hda1
 # mdadm /dev/md4 --fail /dev/hda6 --remove /dev/hda6</code> # mdadm /dev/md4 --fail /dev/hda6 --remove /dev/hda6</code>
 +''/dev/md2'' is a RAID0 stripe mounted as ''/scratch'', so we have to umount it and then stop it (you can't remove volumes from a stripe):
 +<code># umount /dev/md2
 +# mdadm --stop /dev/md2</code>
 <note warning> You must Shutdown the server before you physically remove the drive! </note> <note warning> You must Shutdown the server before you physically remove the drive! </note>
 Shut the server down and replace the faulty drive with a new one.  After booting your drive letters may have shifted around, so just be sure to verify which is which before proceeding. Shut the server down and replace the faulty drive with a new one.  After booting your drive letters may have shifted around, so just be sure to verify which is which before proceeding.
Line 91: Line 91:
 /dev/sdc: msdos partitions 1 /dev/sdc: msdos partitions 1
 </code> </code>
-You can now add the new partitions back to the arrays:+Re-create the scratch partition (RAID0): 
 +<code># mdadm --create --verbose /dev/md2 --level=0 --raid-devices=2 /dev/hda5 /dev/hdd5 
 +# mkfs.ext3 /dev/md2 
 +# mount /dev/md2 /scratch</code> 
 +You can now add the new partitions back to the RAID1 arrays:
 <code># mdadm /dev/md0 --add /dev/hdd2 <code># mdadm /dev/md0 --add /dev/hdd2
 # mdadm /dev/md1 --add /dev/hdd3 # mdadm /dev/md1 --add /dev/hdd3
Line 116: Line 120:
              
 unused devices: <none></file> unused devices: <none></file>
-Clearing any previous raid info on a disk (eg. reusing a disk from another decommissioned raid array) 
- 
-# mdadm --zero-superblock /dev/hdc1 
-Adding a disk to an array 
- 
-# mdadm --add /dev/md0 /dev/hdc1 
- 
- 
-=== To Do list: === 
- 
- 
-Prepare written instructions on how to repair disk arrays. 
- 
-What disks to we have? 
- 
-Add extra spare disks? 
- 
-How do you know which physical disk is broken to replace it? 
- 
-f 
- 
 ===== Hardware RAID ===== ===== Hardware RAID =====
  
-A 3ware 9500S SATA RAID card using the 3w-9xxx kernel module.  It has 12 channels.  The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID.+A 3ware 9500S-12 SATA RAID card using the 3w-9xxx kernel module.  It has 12 channels.  The HPC is configured to use RAID5 for all of its RAID arrays on the hardware RAID.
  
 ==== Physical Disk Layout ==== ==== Physical Disk Layout ====
  
-We have one RAID controller, 'c1' Disks are plugged into ports, 'p1' - 'p11' The disks are then grouped into units (basically the rows), 'u0' - 'u2'.+We have one RAID controller, 'c1' Disks are plugged into ports, 'p0' - 'p11' The disks are then grouped into units (basically the rows), 'u0' - 'u2'.
  
 | Port 8 | Port 9 | Port 10 | Port 11 | | Port 8 | Port 9 | Port 10 | Port 11 |
Line 151: Line 134:
 ==== Repairing 'degraded' arrays ==== ==== Repairing 'degraded' arrays ====
  
-There is a utility, tw_cli, which can be used to control/monitor the hardware raid controller.+There is a utility, ''tw_cli'', which can be used to control/monitor the hardware raid controller.
  
 Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for: Study the output of ''show'' to know which controller to manage.  Then you can use ''/c1 show'' to show the status of that particular controller.  Things to look for:
Line 178: Line 161:
 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3</file> 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=3</file>
  
-<code>$ sudo tw_cli +<code>$ sudo tw_cli
 Password:  Password: 
 //hpc-ilri> /c1 show //hpc-ilri> /c1 show
raid.1258437594.txt.gz · Last modified: 2010/05/22 14:19 (external edit)