raid
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
raid [2009/08/27 09:14] – created 172.26.0.166 | raid [2010/09/19 23:58] (current) – aorth | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | === HPC RAID array === | + | ===== RAID ===== |
- | The storage | + | We have two RAIDs on the HPC |
+ | * Linux kernel software | ||
+ | * 3ware hardware RAID | ||
+ | ==== Drive numbering ==== | ||
- | It is currently reporting a degraded array: | + | If you're looking at the front of the HPC you'll see four rows of drives. |
+ | * Rows 0 - 2 are SATA, connected to the hardware 3ware RAID card | ||
+ | * Row 3 are IDE | ||
- | cat / | + | ===== Software RAID ===== |
- | Personalities : [raid0] [raid1] | + | The Linux kernel has the '' |
- | md1 : active raid1 hda1[0] | + | |
- | | + | Here is information on their configuration: |
+ | |||
+ | < | ||
+ | /dev/md0 on / type ext3 (rw) | ||
+ | /dev/md3 on /boot type ext3 (rw) | ||
+ | /dev/md2 on /scratch type ext3 (rw) | ||
+ | /dev/md1 on /export type ext3 (rw) | ||
+ | # df -h | grep md | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | |||
+ | It should be noted that ''/ | ||
+ | < | ||
+ | Filename Type Size Used Priority | ||
+ | / | ||
+ | |||
+ | A snapshot of the software RAID's health: | ||
+ | |||
+ | < | ||
+ | Personalities : [raid1] [raid0] | ||
+ | md3 : active raid1 hdd1[1] | ||
+ | | ||
| | ||
- | md3 : active raid1 hdc3[1] hda3[0] | + | md1 : active raid1 hdd3[1] hda3[0] |
- | | + | |
| | ||
- | md2 : active | + | md2 : active |
- | | + | |
| | ||
- | md0 : active | + | md4 : active |
- | | + | |
| | ||
- | unused devices: < | + | md0 : active raid1 hdd2[1] hda2[0] |
+ | 30716160 blocks [2/2] [UU] | ||
+ | |||
+ | unused devices: <none></ | ||
+ | ==== Repair RAID ==== | ||
+ | When a disk is failing you need to replace the drive. | ||
+ | < | ||
+ | Personalities : [raid1] [raid0] | ||
+ | md3 : active raid1 hdd1[1] hda1[0] | ||
+ | 200704 blocks [2/2] [UU] | ||
+ | |||
+ | md1 : active raid1 hdd3[1] hda3[0] | ||
+ | 26627648 blocks [2/2] [UU] | ||
+ | |||
+ | md2 : active raid0 hdd5[1] hda5[0] | ||
+ | 36868608 blocks 256k chunks | ||
+ | |||
+ | md4 : active raid1 hdd6[1] hda6[0] | ||
+ | 2168640 blocks [2/2] [UU] | ||
+ | |||
+ | md0 : active raid1 hdd2[1] hda2[0] | ||
+ | 30716160 blocks [2/2] [UU] | ||
+ | |||
+ | unused devices: < | ||
+ | |||
+ | If ''/ | ||
+ | < | ||
+ | # mdadm /dev/md1 --fail /dev/hda3 --remove /dev/hda3 | ||
+ | # mdadm /dev/md3 --fail /dev/hda1 --remove /dev/hda1 | ||
+ | # mdadm /dev/md4 --fail /dev/hda6 --remove / | ||
+ | ''/ | ||
+ | < | ||
+ | # mdadm --stop / | ||
+ | <note warning> You must Shutdown the server before you physically remove the drive! </ | ||
+ | Shut the server down and replace the faulty drive with a new one. After booting your drive letters may have shifted around, so just be sure to verify which is which before proceeding. | ||
+ | Clone the partition table from the good drive to the bad one: | ||
+ | < | ||
+ | Verify the new partitions can be seen: | ||
+ | < | ||
+ | /dev/hda: msdos partitions 1 2 3 4 <5 6> | ||
+ | /dev/hdd: msdos partitions 1 2 3 4 <5 6> | ||
+ | /dev/sda: msdos partitions 1 | ||
+ | /dev/sdb: msdos partitions 1 | ||
+ | /dev/sdc: msdos partitions 1 | ||
+ | </ | ||
+ | Re-create the scratch partition (RAID0): | ||
+ | < | ||
+ | # mkfs.ext3 /dev/md2 | ||
+ | # mount /dev/md2 / | ||
+ | You can now add the new partitions back to the RAID1 arrays: | ||
+ | < | ||
+ | # mdadm /dev/md1 --add /dev/hdd3 | ||
+ | # mdadm /dev/md3 --add /dev/hdd1 | ||
+ | # mdadm /dev/md4 --add / | ||
+ | After adding you can monitor the progress of the RAID rebuilds by looking in ''/ | ||
+ | < | ||
+ | md3 : active raid1 hdd1[1] hda1[0] | ||
+ | 200704 blocks [2/2] [UU] | ||
+ | |||
+ | md1 : active raid1 hdd3[2] hda3[0] | ||
+ | 26627648 blocks [2/1] [U_] | ||
+ | [===================> | ||
+ | |||
+ | md2 : inactive hda5[0] | ||
+ | 18434304 blocks | ||
+ | |||
+ | md4 : active raid1 hdd6[2] hda6[0] | ||
+ | 2168640 blocks [2/1] [U_] | ||
+ | resync=DELAYED | ||
+ | |||
+ | md0 : active raid1 hdd2[1] hda2[0] | ||
+ | 30716160 blocks [2/2] [UU] | ||
+ | |||
+ | unused devices: < | ||
+ | ===== Hardware RAID ===== | ||
+ | |||
+ | A 3ware 9500S-12 SATA RAID card using the 3w-9xxx kernel module. | ||
+ | |||
+ | ==== Physical Disk Layout ==== | ||
+ | |||
+ | We have one RAID controller, ' | ||
+ | |||
+ | | Port 8 | Port 9 | Port 10 | Port 11 | | ||
+ | | Port 4 | Port 5 | Port 6 | Port 7 | | ||
+ | | Port 0 | Port 1 | Port 2 | Port 3 | | ||
+ | |||
+ | ==== Repairing ' | ||
+ | |||
+ | There is a utility, '' | ||
+ | |||
+ | Study the output of '' | ||
+ | * Which controller is active? (c0, c1, etc) | ||
+ | * Which unit is degraded? (u0, u1, u2, etc) | ||
+ | * Which port is inactive or missing? (p1, p5, etc) | ||
+ | |||
+ | <note warning> | ||
+ | |||
+ | Remove the faulty port: | ||
+ | < | ||
+ | Insert a new drive and rescan: | ||
+ | < | ||
+ | Rebuild the degraded array: | ||
+ | < | ||
+ | |||
+ | Check the status of the rebuild by monitoring ''/ | ||
+ | |||
+ | < | ||
+ | 3w-9xxx: scsi1: AEN: INFO (0x04: | ||
+ | |||
+ | This sucks: | ||
+ | |||
+ | < | ||
+ | 3w-9xxx: scsi1: AEN: INFO (0x04: | ||
+ | 3w-9xxx: scsi1: AEN: ERROR (0x04: | ||
+ | |||
+ | < | ||
+ | Password: | ||
+ | // | ||
+ | |||
+ | Unit UnitType | ||
+ | ------------------------------------------------------------------------------ | ||
+ | u0 RAID-5 | ||
+ | u1 RAID-5 | ||
+ | u2 RAID-5 | ||
+ | |||
+ | Port | ||
+ | --------------------------------------------------------------- | ||
+ | p0 | ||
+ | p1 | ||
+ | p2 | ||
+ | p3 | ||
+ | p4 | ||
+ | p5 | ||
+ | p6 | ||
+ | p7 | ||
+ | p8 | ||
+ | p9 | ||
+ | p10 OK | ||
+ | p11 OK | ||
+ | Looks like another drive failed. |
raid.txt · Last modified: 2010/09/19 23:58 by aorth