CEPH OSD_SCRUB_ERRORS 错误处理
1. 错误描述# ceph health detailHEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 1 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistentpg 2.33c6 is active+clean+
·
1. 错误描述
# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 2.33c6 is active+clean+inconsistent, acting [355,138,29]
2. 问题定位
查看对应PG的不一致对象列表:
# rados list-inconsistent-obj 2.33c6 --format=json-pretty
{
"epoch": 560,
"inconsistents": [
{
"object": {
"name": "1159921",
"nspace": "",
"locator": "",
"snap": "head",
"version": 67
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"selected_object_info": {
"oid": {
"oid": "1159921",
"key": "",
"snapid": -2,
"hash": 1986311110,
"max": 0,
"pool": 2,
"namespace": ""
},
"version": "584'67",
"prior_version": "0'0",
"last_reqid": "client.453617.0:72999",
"user_version": 67,
"size": 4194304,
"mtime": "2020-12-10 20:58:11.523179",
"local_mtime": "2020-12-10 20:58:11.528662",
"lost": 0,
"flags": [
"dirty",
"data_digest"
],
"legacy_snaps": [],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf45029cc",
"omap_digest": "0xffffffff",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0,
"redirect_target": {
"oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -9223372036854775808,
"namespace": ""
}
},
"watchers": {}
},
"shards": [
{
"osd": 29,
"primary": false,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0xf45029cc"
},
{
"osd": 138,
"primary": false,
"errors": [
"read_error"
],
"size": 4194304
},
{
"osd": 355,
"primary": true,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0xf45029cc"
}
]
}
]
}
发现有一个对象的一个138副本出现了read_error
,去主osd355上查看日志可以看到具体scrub-error日志:
# grep '2.33c6' ceph-osd.355.log-20201125
2020-12-08 23:00:00.469371 7ff5b8c43700 0 log_channel(cluster) log [DBG] : 2.33c6 scrub starts
2020-12-08 23:00:00.472653 7ff5b8c43700 0 log_channel(cluster) log [DBG] : 2.33c6 scrub ok
2020-12-10 23:00:05.031243 7ff5b8c43700 0 log_channel(cluster) log [DBG] : 2.33c6 deep-scrub starts
2020-12-10 23:00:19.077055 7ff5b8c43700 -1 log_channel(cluster) log [ERR] : 2.33c6 shard 138 soid 2:63cd266e:::1159921:head : candidate had a read error
2020-12-10 23:00:19.938043 7ff5b8c43700 -1 log_channel(cluster) log [ERR] : 2.33c6 deep-scrub 0 missing, 1 inconsistent objects
2020-12-10 23:00:19.938052 7ff5b8c43700 -1 log_channel(cluster) log [ERR] : 2.33c6 deep-scrub 1 errors
去osd138上查看系统日志发现sdh坏道:
[四 12月 10 23:03:02 2020] Process accounting resumed
[四 12月 10 23:03:15 2020] megaraid_sas 0000:02:00.0: 5305 (660927610s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 06(e0x20/s6) at 255d319
[四 12月 10 23:03:18 2020] sd 0:2:7:0: [sdh] tag#0 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:18 2020] sd 0:2:7:0: [sdh] tag#2 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#2 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#0 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] megaraid_sas 0000:02:00.0: 5307 (660927613s/0x0001/FATAL) - Uncorrectable medium error logged for VD 07/7 at 255d319 (on PD 06(e0x20/s6) at 255d319)
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#1 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#0 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] megaraid_sas 0000:02:00.0: 5308 (660927613s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 06(e0x20/s6) at 255d412
[四 12月 10 23:03:20 2020] megaraid_sas 0000:02:00.0: 5309 (660927613s/0x0001/FATAL) - Uncorrectable medium error logged for VD 07/7 at 255d412 (on PD 06(e0x20/s6) at 255d412)
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#4 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#4 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#5 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#5 Sense Key : Medium Error [current]
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#5 Add. Sense: No additional sense information
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#5 CDB: Read(16) 88 00 00 00 00 00 02 55 d3 80 00 00 02 00 00 00
[四 12月 10 23:03:20 2020] blk_update_request: I/O error, dev sdh, sector 39179136
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 BRCM Debug mfi stat 0x2d, data len requested/completed 0x40000/0x0
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 Sense Key : Medium Error [current]
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 Add. Sense: No additional sense information
[四 12月 10 23:03:20 2020] sd 0:2:7:0: [sdh] tag#3 CDB: Read(16) 88 00 00 00 00 00 02 55 d1 80 00 00 02 00 00 00
[四 12月 10 23:03:20 2020] blk_update_request: I/O error, dev sdh, sector 39178624
[四 12月 10 23:04:02 2020] Process accounting resumed
sdh正是osd138对应的硬盘:
# pwd
/var/lib/ceph/osd/ceph-138
# ls -l
总用量 48
-rw-r--r-- 1 ceph ceph 768 11月 23 18:05 activate.monmap
lrwxrwxrwx 1 ceph ceph 93 11月 23 18:05 block -> /dev/ceph-799147d1-13d6-4229-b5db-4e31a61ad5b4/osd-block-389e5a16-060a-515e-bbc3-9a0d74beccf2
-rw-r--r-- 1 ceph ceph 2 11月 23 18:05 bluefs
-rw-r--r-- 1 ceph ceph 37 11月 23 18:05 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 11月 23 18:05 fsid
-rw------- 1 ceph ceph 57 11月 23 18:05 keyring
-rw-r--r-- 1 ceph ceph 8 11月 23 18:05 kv_backend
-rw-r--r-- 1 ceph ceph 21 11月 23 18:05 magic
-rw-r--r-- 1 ceph ceph 4 11月 23 18:05 mkfs_done
-rw-r--r-- 1 ceph ceph 41 11月 23 18:05 osd_key
-rw-r--r-- 1 ceph ceph 6 11月 23 18:05 ready
-rw-r--r-- 1 ceph ceph 10 11月 23 18:05 type
-rw-r--r-- 1 ceph ceph 4 11月 23 18:05 whoami
# pvs
PV VG Fmt Attr PSize PFree
/dev/sdb ceph-612fc205-5d11-4818-bd83-251959a71bf3 lvm2 a-- <3.64t 0
/dev/sdc ceph-f31c27c4-0c5a-4b82-af21-790647433f49 lvm2 a-- <3.64t 0
/dev/sdd ceph-87748016-6b8d-422f-9d25-75c50046331c lvm2 a-- <3.64t 0
/dev/sde ceph-94928fab-d6b5-42eb-b77d-4f507a7c3c95 lvm2 a-- <3.64t 0
/dev/sdf ceph-c488371e-b364-4e04-be75-d41abc561591 lvm2 a-- <3.64t 0
/dev/sdg ceph-efeaff04-c72a-48fe-9742-56030c4cf82d lvm2 a-- <3.64t 0
/dev/sdh ceph-799147d1-13d6-4229-b5db-4e31a61ad5b4 lvm2 a-- <3.64t 0 ### sdh正是osd-138对应的盘
/dev/sdi ceph-90a95edd-539e-4ca9-806a-f994f29972af lvm2 a-- <3.64t 0
/dev/sdj ceph-006f3237-d6b3-4353-beb9-8d6307c0cf24 lvm2 a-- <3.64t 0
/dev/sdk ceph-361dae3d-c7e2-4ec7-bb14-ffac49acbb3a lvm2 a-- <3.64t 0
/dev/sdl ceph-6b4f8394-9b01-4c21-a8a5-6aa283ff3662 lvm2 a-- <3.64t 0
/dev/sdm ceph-1fc4f01f-bc05-489a-a15c-33834adc197b lvm2 a-- <3.64t 0
3. 数据修复
# ceph pg repair 2.33c6
instructing pg 2.33c6 on osd.355 to repair
# ...
# ceph health detail
HEALTH_OK
# rados list-inconsistent-obj 2.33c6 --format=json-pretty
{
"epoch": 560,
"inconsistents": []
}
#
修复完成。
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
已为社区贡献2条内容
所有评论(0)