einj 注入内存ue/ce故障
einj 注入内存ue/ce故障
缩写
缩写 | 全称及说明 |
---|---|
SRAO | Software Recoverable Action Optional SRAO通过MCE或CMCI上报。SRAO Error表示系统中某些数据损坏,但是未被消费。软件恢复措施是可选的,可以根据MCACOD采取恢复策略。 |
SRAR | Software Recoverable Action Required SRAR通过MCE上报。SRAR Error表示系统中某些数据损坏且正在被消费,软件必须在此CPU任务调度前采取recovery action(通常是kill当前cpu上进程,但不限于此)。如果无法恢复,比如无法获取Addr或Task信息,则应该Panic。 |
EDAC | Error Detection And Coreection |
CE | Corrected Error |
UCE | Uncorrected Error |
UCR | Uncorrected Recoverable Error(硬件无法自修复,但软件可采取行为修复的错误) |
MCA | Machine Check Architecture |
MCE | Machine Check Exception(异常) |
CMCI | Corrected Machine Check Error Interrupt(中断) 机器检查错误表示内存中发生了可纠正的错误,通常是由于软件或硬件故障引起的。这种错误不会导致系统崩溃,但可能会导致系统性能下降。 |
UCNA | Uncorrected No Action Required UCNA通过CMCI上报。UCNA Error表示系统中的某些数据已损坏,但数据尚未被消费(即没被read),并且处理器的状态可用,程序可以继续在处理器上执行。 |
工具下载
https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools/
如果支持EINJ确实在boot日志里看到EINJ的相关信息
ACPI: EINJ 0x0000000049D553E0 000150 (v01 ALASKA A M I 00000001 INTL 00000001)
einj.ko内核驱动安装
加载einj.ko.xz
驱动时需要先配置好BIOS
的相关选项WHEA Error Injection Support
1、驱动路径 - /lib/modules/xx/kernel/drivers/acpi/apei/einj.ko.xz
2、安装驱动 - insmode /lib/modules/xx/kernel/drivers/acpi/apei/einj.ko.xz
3、可以查看对应einj对应节点 ll /sys/kernel/debug/apei/einj/
--w------- 1 root root 0 Sep 11 10:37 error_inject
-rw------- 1 root root 0 Sep 11 10:37 error_type
-rw------- 1 root root 0 Sep 11 10:37 flags
-rw------- 1 root root 0 Sep 11 10:37 notrigger
-rw------- 1 root root 0 Sep 11 10:37 param1
-rw------- 1 root root 0 Sep 11 10:37 param2
-rw------- 1 root root 0 Sep 11 10:37 param3
-rw------- 1 root root 0 Sep 11 10:37 param4
-r-------- 1 root root 0 Sep 11 10:37 vendor
-rw------- 1 root root 0 Sep 11 10:37 vendor_flags
-r-------- 1 root root 0 Sep 11 10:37 available_error_type
$ insmod einj.ko.xz — 该驱动打开内核相关选项之后,被编译为一个内核驱动。
驱动加载成功可以在内核日志看到如下信息
EINJ: Error INJection is initialized.
内核文档
Documentation/acpi/apei/einj.txt
APEI Error INJection
~~~~~~~~~~~~~~~~~~~~
EINJ provides a hardware error injection mechanism. It is very useful
for debugging and testing APEI and RAS features in general.
You need to check whether your BIOS supports EINJ first. For that, look
for early boot messages similar to this one:
ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001)
which shows that the BIOS is exposing an EINJ table - it is the
mechanism through which the injection is done.
Alternatively(或者), look in /sys/firmware/acpi/tables for an "EINJ" file,
which is a different representation of the same thing.
It doesn't necessarily mean(并不一定意味着) that EINJ is not supported if those above(以上)
don't exist: before you give up, go into BIOS setup to see if the BIOS
has an option to enable error injection. Look for something called WHEA
or similar. Often, you need to enable an ACPI5 support option prior(事先), in
order to see the APEI,EINJ,... functionality supported and exposed by
the BIOS menu.
To use EINJ, make sure the following are options enabled in your kernel
configuration:
CONFIG_DEBUG_FS
CONFIG_ACPI_APEI
CONFIG_ACPI_APEI_EINJ
The EINJ user interface is in <debugfs mount point>/apei/einj.
The following files belong to it:
- available_error_type
This file shows which error types are supported:
Error Type Value Error Description
================ =================
0x00000001 Processor Correctable
0x00000002 Processor Uncorrectable non-fatal
0x00000004 Processor Uncorrectable fatal
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
0x00000020 Memory Uncorrectable fatal
0x00000040 PCI Express Correctable
0x00000080 PCI Express Uncorrectable fatal
0x00000100 PCI Express Uncorrectable non-fatal
0x00000200 Platform Correctable
0x00000400 Platform Uncorrectable non-fatal
0x00000800 Platform Uncorrectable fatal
The format of the file contents are as above, except present are only
the available error types.
- error_type
Set the value of the error type being injected. Possible error types
are defined in the file available_error_type above.
- error_inject
Write any integer to this file to trigger the error injection. Make
sure you have specified all necessary error parameters, i.e. this
write should be the last step when injecting errors.
- flags
Present(目前) for kernel versions 3.13 and above. Used to specify(说明) which
of param{1..4} are valid and should be used by the firmware during
injection. Value is a bitmask as specified in ACPI5.0 spec for the
SET_ERROR_TYPE_WITH_ADDRESS data structure:
Bit 0 - Processor APIC field valid (see param3 below).
Bit 1 - Memory address and mask valid (param1 and param2).
Bit 2 - PCIe (seg,bus,dev,fn) valid (see param4 below).
If set to zero, legacy behavior is mimicked(模仿) where the type of
injection specifies just one bit set, and param1 is multiplexed.
- param1
This file is used to set the first error parameter value. Its effect
depends on the error type specified in error_type. For example, if
error type is memory related type, the param1 should be a valid
physical memory address. [Unless "flag" is set - see above]
- param2
Same use as param1 above. For example, if error type is of memory
related type, then param2 should be a physical memory address mask.
Linux requires page or narrower granularity(更窄粒度), say, 0xfffffffffffff000.
- param3
Used when the 0x1 bit is set in "flags" to specify the APIC id
- param4
Used when the 0x4 bit is set in "flags" to specify target PCIe device
- notrigger
The error injection mechanism is a two-step process. First inject the
error, then perform some actions to trigger it. Setting "notrigger"
to 1 skips the trigger phase(阶段), which *may* allow the user to cause the
error in some other context by a simple access to the CPU, memory
location, or device that is the target of the error injection. Whether
this actually works depends on what operations the BIOS actually
includes in the trigger phase.
BIOS versions based on the ACPI 4.0 specification have limited options
in controlling where the errors are injected. Your BIOS may support an
extension (enabled with the param_extension=1 module parameter, or boot
command line einj.param_extension=1). This allows the address and mask
for memory injections to be specified by the param1 and param2 files in
apei/einj.
BIOS versions based on the ACPI 5.0 specification have more control over
the target of the injection. For processor-related errors (type 0x1, 0x2
and 0x4), you can set flags to 0x3 (param3 for bit 0, and param1 and
param2 for bit 1) so that you have more information added to the error
signature being injected. The actual data passed is this:
memory_address = param1;
memory_address_range = param2;
apicid = param3;
pcie_sbdf = param4;
For memory errors (type 0x8, 0x10 and 0x20) the address is set using
param1 with a mask in param2 (0x0 is equivalent to all ones). For PCI
express errors (type 0x40, 0x80 and 0x100) the segment, bus, device and
function are specified using param1:
31 24 23 16 15 11 10 8 7 0
+-------------------------------------------------+
| segment | bus | device | function | reserved |
+-------------------------------------------------+
Anyway, you get the idea, if there's doubt just take a look at the code
in drivers/acpi/apei/einj.c.
An ACPI 5.0 BIOS may also allow vendor-specific errors to be injected.
In this case a file named vendor will contain identifying information
from the BIOS that hopefully will allow an application wishing to use
the vendor-specific extension to tell that they are running on a BIOS
that supports it. All vendor extensions have the 0x80000000 bit set in
error_type. A file vendor_flags controls the interpretation of param1
and param2 (1 = PROCESSOR, 2 = MEMORY, 4 = PCI). See your BIOS vendor
documentation for details (and expect changes to this API if vendors
creativity in using this feature expands beyond our expectations).
An error injection example:
# cd /sys/kernel/debug/apei/einj
# cat available_error_type # See which errors can be injected
0x00000002 Processor Uncorrectable non-fatal
0x00000008 Memory Correctable
0x00000010 Memory Uncorrectable non-fatal
# echo 0x12345000 > param1 # Set memory address for injection
# echo $((-1 << 12)) > param2 # Mask 0xfffffffffffff000 - anywhere in this page
# echo 0x8 > error_type # Choose correctable memory error
# echo 1 > error_inject # Inject now
You should see something like this in dmesg:
[22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR
[22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
[22715.834759] EDAC sbridge MC3: TSC 0
[22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86
[22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0
[22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
For more information about EINJ, please refer to ACPI specification
version 4.0, section 17.5 and ACPI 5.0, section 18.6.
内存UE/CE故障注入类型
./einj_mem_uc: invalid option -- '-'
Usage: ./einj_mem_uc [-a][-c count][-d delay][-f][-i][j][k] [-m runup:size:align][testname]
Testname Fatal Description
single no Single read in pipeline to target address, generates SRAR machine check
double no Double read in pipeline to target address, generates SRAR machine check
split YES Unaligned read crosses cacheline from good to bad. Probably fatal
THP no Try to inject in transparent huge page, generates SRAR machine check
hugetlb no Try to inject in hugetlb page, generates SRAR machine check
store no Write to target address. Should generate a UCNA/CMCI
prefetch no Prefetch data into L1 cache. Should generate CMCI
memcpy YES Streaming read from target address. Probably fatal
instr no Instruction fetch. Generates SRAR that OS should transparently fix
patrol no Patrol scrubber, generates SRAO machine check
thread no Single read by two threads to target address at the same time, generates SRAR machine check
share no Share memory is read by two tasks to target address, generates SRAR machine check
overflow YES Read to two target addresses at the same time, Probably fatal
llc no Cache write-back, generates SRAO machine check
copyin YES Kernel copies data from user. Probably fatal
copyout YES Kernel copies data to user. Probably fatal
copy-on-write YES Kernel copies user page. Probably fatal
futex YES Kernel access to futex(2). Probably fatal
mlock no mlock target page then inject/read to generates SRAR machine check
core_ce no Core corrected error
core_non_fatal no Core deferred error
core_fatal YES Core uncorrected error. Should fatal
词 | 析 |
---|---|
Single read in pipeline | 个人理解这里的pipeline 指的是cpu 的流水线。 |
split | 未对齐读取从好到坏跨越缓存行,可能致命。 |
THP | Transparent Huge Pages是在运行时动态分配的大页内存,而标准的HugePages是在系统启动时预先分配内存,并在系统运行时不再改变。 |
hugetlb | hugetlb 相当于是 huge page 页面管理者,页面的分配及释放,都由此模块负责。 |
prefetch | 将数据(存放在内存中)预取到一级缓存中 |
patrol | 巡检 |
futex | The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the con‐text of shared-memory synchronization.When using futexes, the majority of the synchronization operations are performed in user space. |
mlock | 系统调用 mlock 家族允许程序在物理内存上锁住它的部分或全部地址空间。这将阻止Linux 将这个内存页调度到交换空间(swap space),即使该程序已有一段时间没有访问这段空间。 |
MCE
MCE(Machine Check Exception)是由CPU侦测出来的错误,它错误包含两种主要类型:notice(提示)/warning(警告),和fatal exception(致命性的错误)。Warning(警告)将会在你的系统log下输出一条类似于"Machine Check Event logged"的信息,我们可以通过一些linux的应用程序对这部分log进行详细的信息查看;而fatal MCE(致命的错误)会导致机器停止响应,MCE的详细信息也将会输出到系统的console中。
什么会导致MCE错误出现?
- 内存错误或ECC问题
- 散热不足、CPU过热
- 系统总线错误
- 缓存处理器或硬件错误
EDAC: https://zhuanlan.zhihu.com/p/29013350
RAS特性解析: https://www.zhihu.com/people/helloxiao-cui/posts
https://blog.csdn.net/leoufung/article/details/48784191?ydreferer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8%3D
内存类型
文件页(File-backed Page)
通过free看到的缓存cache统计对应的缓存页都是文件页,它们都对应着系统中的文件、数据。如果没有与之对应的文件,我们就称其为匿名页。File-backed Pages在内存不足的时候可以直接写回对应的硬盘文件里,即Page-out,以释放内存,需要时从磁盘再次读取数据。比如我们可以通过echo 3 > /proc/sys/vm/drop_caches
的方式释放大部分cache。
匿名页(Anonymous Page)
应用程序使用的堆,栈,数据段等,没有文件背景的页面被称为匿名页,它们不是以文件形式存在,因此无法和磁盘文件交换,但可以通过硬盘上划分额外的swap交换分区或使用交换文件进行交换,即Swap-out
。匿名页与用户进程共存,进程退出则匿名页释放,而Page Cache
即使在进程退出后还可以缓存。
脏页(Dirty Page)
被应用程序修改过,并且暂时还没写入磁盘的数据使用的内存页被称为脏页(Dirty Page)。如果要释放这些页面,就得先写入磁盘。这些脏页,一般可以通过两种方式写入磁盘。一个是通过系统调用fsync
,把脏页刷到磁盘中;也可以交给系统,由内核线程Pdflush
将脏页刷到磁盘。
大页(Hugepages)
为了降低TLB miss
的概率,Linux引入了Hugepages机制,可以设定Page大小为2MB或者1GB。2MB的Hugepages机制下,同样256GB内存需要的页表项降低为256GB/2MB=131072,仅需要2MB。因此Hugepages的页表可以全量缓存在CPU cache中。 通过sysctl -w vm.nr_hugepages=1024可以设置hugepages的个数为1024,总大小为4GB。需要注意是,设置huagepages会从系统申请连续2MB的内存块并进行保留(不能用于正常内存申请),如果系统运行一段时间导致内存碎片较多时,再申请hugepages会失败。
透明大页(THP)
由于Huge pages很难手动管理,而且通常需要对代码进行重大的更改才能有效的使用,因此又引入了Transparent Huge Pages(THP),THP 是一个抽象层,能够自动创建、管理和使用传统大页。标准大页管理是预分配的方式,而透明大页管理则是动态分配的方式。
LRU
LRU(Least Recently Used) 中文翻译是 最近最少使用 的意思,其原理就是:当内存不足时,淘汰系统中最少使用的内存,这样对系统性能的损耗是最小的。
vdso page
vDSO是virtual dynamic shared object
的缩写,表示这段mapping实际包含的是一个ELF共享目标文件,也就是俗称的.so
内存错误分类
内存常见错误按照类型主要包括CE(Correctable Error)、UCE(Un-Correctable Error),按照场景主要包括内存读写错误、内存巡检错误。
按类型
类型 | 说明 |
---|---|
CE | 服务器在运行过程中,发生了错误,但错误可以通过ECC(Error Checking and Correcting)来纠正。所以有时又将CE错误称为ECC错误。偶发性的地址命令错误、x4颗粒内存的单颗粒多bit错误、x8颗粒内存的单颗粒单bit错误都有可能导致ECC错误。CE错误对系统没有影响。 |
CE Storm | BIOS对每次SMI中断处理时记录时间戳,每两个SMI中断的时间间隔小于1分钟就会连续计数,当计数达到10个时就判定为CE风暴。 |
CE Overflow | 内存可纠正错误以rank为单位提供计数及阈值设置,当rank内的可纠正错误达到阈值溢出时,触发SMI中断。同时,考虑到时间维度,加入硬件漏斗,即每间隔一定的时间,未出现CE错误时,故障计数器自动减1。 |
UCE | 服务器在运行过程中,发生了错误并且错误无法通过ECC来纠正。x8颗粒内存的多bit错误、x4颗粒的多颗粒多bit错误、持续的地址命令错误都有可能导致UCE、register芯片损坏。 |
按场景
内存读写错误(Corrected read/write Error) | 服务器运行过程中,业务处理时进行数据交换,内存出现故障导致数据错误,传输过程中,Intel CPU检测到后上报告警。 |
内存巡检错误 | 服务器运行过程中,Intel CPU会针对内存进行巡检,若发现内存UCE故障则上报OS告警,但很多情况下内存实际并未发生故障,数据校验机制有潜在Bug,导致产生误报,可以降级成CE处理。 |
Corrected Patrol Scrub Error | 在空闲的时候读取内存中的内容,如果读出的数据存在可以纠正的错误(不可恢复的错误即为Downgraded Uncorrected Patrol Scrubbing Error),将纠正后的数据重新写入到内存中。 |
来源:https://info.support.huawei.com/compute/docs/zh-cn/kunpeng-knowledge/typical-scenarios-1/zh-cn_topic_0000001137846921.html
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)