倉庫マシンのHDD異常 - amadasiの日記

FreeNAS8.0.1RC1にて稼動させているファイルサーバで以下のようなログが吐かれていた。

Device: /dev/ada1, 6 Offline uncorrectable sectors
Device: /dev/ada1, Currently unreadable (pending) sectors

smartctlにて確認してみると

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   145   143   021    Pre-fail  Always       -       9708
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       2
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       622
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       59
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       41
193 Load_Cycle_Count        0x0032   187   187   000    Old_age   Always       -       40031
194 Temperature_Celsius     0x0022   119   107   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       6

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       607         1659864016
# 2  Extended offline    Completed: read failure       90%       607         1659864016
# 3  Extended offline    Completed: read failure       90%       606         1659864016
# 4  Extended offline    Completed: read failure       90%       605         1659864016
# 5  Conveyance offline  Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

のように不良セクタが出ているみたい。
データはだいじょうぶなのか謎だけど

# zpool status
  pool: ZFS-POOL
 state: ONLINE
 scrub: none requested
config:

	NAME                                            STATE     READ WRITE CKSUM
	ZFS-POOL                                        ONLINE       0     0     0
	  raidz2                                        ONLINE       0     0     0
	    gptid/ae4906c5-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/ae9684ec-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/af2b4bbb-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/afbdff8f-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b04cea05-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b0e1b146-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0

errors: No known data errors

とのことなので問題ないのだと信じておく。
さて、気分も悪いのでマシンをシャットダウンしてこのディスクを抜いて新しいものと差しかえてみる事にする。

- - -

というわけで、まず異常を検知しているドライブのみひっこぬいてみた。

/# zpool status
  pool: ZFS-POOL
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

	NAME                                            STATE     READ WRITE CKSUM
	ZFS-POOL                                        DEGRADED     0     0     0
	  raidz2                                        DEGRADED     0     0     0
	    gptid/ae4906c5-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    6483663345112995382                         UNAVAIL      0     0     0  was /dev/gptid/ae9684ec-d63c-11e0-9b18-14dae93d2f3c
	    gptid/af2b4bbb-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/afbdff8f-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b04cea05-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b0e1b146-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0

errors: No known data errors

さっきはONLINEだったステータスがDEGRADEDになっており、恐らく抜いたドライブがUNAVAILになっている。
この状態でLionからいつも通り共有してみても、なにも問題なくアクセスできる。すげえぜRAIDZ。

そんなわけで再度smartctlの結果

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   160   160   021    Pre-fail  Always       -       8991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       12
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       24
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       18
194 Temperature_Celsius     0x0022   125   119   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

新品ほやほや。

/# zpool status
  pool: ZFS-POOL
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

	NAME                                            STATE     READ WRITE CKSUM
	ZFS-POOL                                        DEGRADED     0     0     0
	  raidz2                                        DEGRADED     0     0     0
	    gptid/ae4906c5-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    6483663345112995382                         UNAVAIL      0     0     0  was /dev/gptid/ae9684ec-d63c-11e0-9b18-14dae93d2f3c
	    gptid/af2b4bbb-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/afbdff8f-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b04cea05-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b0e1b146-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0

errors: No known data errors

本当は手順が逆だったかもしれないけど、この後メンテナンスのWEBページでは古いデバイスが表示されているので、それをリプレイスしてデタッチする。そしてEDITにて希望するプールのメンバにしてやるようにすればこのとおり。

# zpool status
  pool: ZFS-POOL
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h4m, 1.49% done, 5h11m to go
config:

	NAME                                            STATE     READ WRITE CKSUM
	ZFS-POOL                                        ONLINE       0     0     0
	  raidz2                                        ONLINE       0     0     0
	    gptid/ae4906c5-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    ada1p2                                      ONLINE       0     0     0  14.2G resilvered
	    gptid/af2b4bbb-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/afbdff8f-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b04cea05-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0
	    gptid/b0e1b146-d63c-11e0-9b18-14dae93d2f3c  ONLINE       0     0     0

errors: No known data errors

3TBが6発のRAIDZ2でデータ量は3TB程あって、およそ5時間程かかるらしい。
とりあえず一件落着。すべてWEB画面から作業できるので便利。