内容纲要

系统日志中偶尔会有[Hardware Error]: Machine check events logged的信息。

May 29 16:55:19 127.0.0.1 kernel: [Hardware Error]: Machine check events logged

Machine Check Exceptions (MCE)

What are Machine Check Exceptions (or MCE)?
A machine check exception is an error dedected by your system’s processor. There are 2 major types of MCE errors, a notice or warning error, and a fatal execption. The warning will be logged by a "Machine Check Event logged" notice in your system logs, and can be later viewed via some Linux utilities. A fatal MCE will cause the machine to stop responding and the details of the MCE will be printed out to the system’s console.

Machine-check架构

Pentium 4, Intel XeonP6家族的处理器实现了一个主机检测架构(machine-check architecture)来提供一种检查和报告硬件(machine)错误的机制,例如:系统总线错误,ECC错误,奇偶错误(parity errors),缓存错误,以及TLB错误。MCE包含了一系列型号特定寄存器(model-specific registers, MSRs)用于设置主机检测以及额外的MSRs组用于记录检查到的错误。

处理器通过生成一个主机检测异常(machine-check exception),也就是放弃类异常,来记录检测到的不可修复主机检测错误。主机检测架构的实现并通常不是在产生一个machine-check exception时候允许处理器重启。然而,主机检测异常处理器可以从machine-check MSRs搜集有关主机检测错误的信息。

从45 nm Intel 64 处理器开始CPUID报告DisplayFamily_DisplayModel作为 06H_1AH,处理器就可以报告有关主机检测错误的信息并发送一个可编程中断给软件以相应MC错误,引用为修正的主机检测错误中断(CMCI)。

Intel 64处理器支持主机检查架构和CMCI也可以支持一个附加的增强,可命名的,支持从一些不正确的可修复主机检测错误的软件修复。

主机检测异常是通过主机的CPU处理器检测到的错误。有2种主要的MCE错误类型:警告类错误(notice or warning error),和致命异常(fatal exception)。

警告类错误(notice or warning error)将通过一个"Machine Check Event logged"消息记录到系统日志中,然后可以通过一些Linux工具事后查看。
致命异常(fatal exception)则导致主机停止响应,MCE的详细信息将输出到系统的控制台。


mcelog的干什么的?

mcelog 是 x86 的 Linux 系统上用来 检查硬件错误,特别是内存和CPU错误的工具.
mcelog怎么运行的?这三种方式有什么优点?缺点?
有三种运行的方式,crondaemontrigger
cron是最low的方式,会丢失,trigger是比较高级的方式,触发的。一般我们在el6.el7上都是用daemon的方式
线上情况:el6,el7上怎么运行的?
el6上默认应该是使用cron,每小时运行一次,也可以使用daemon守护进程的方式(需要手动执行mcelog --daemon),默认日志打到/var/log/mcelog,和/var/log/message.
el7上默认使用mcelog.service启动的,相当于daemon守护进程的方式,但是,默认日志只打到和/var/log/message,然而默认/var/log/mcelog文件不存在,这个需要在启动命令种加上–logfile=/var/log/mcelog才可以。


实例

下面是我的主机mcelog的部分内容

[root@read3 ~]# less /var/log/mcelog 
mcelog: failed to prefill DIMM database from DMI data
mcelog: mcelog read: No such device
Hardware event. This is not a software error.
MCE 0
CPU 24 BANK 9 
TIME 1367586416 Fri May  3 21:06:56 2013
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d00000c00009008f MCGSTATUS 0
MCGCAP 1000c18 APICID c0 SOCKETID 3 
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 24 BANK 9 
TIME 1369420817 Sat May 25 02:40:17 2013
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d00002800009008f MCGSTATUS 0
MCGCAP 1000c18 APICID c0 SOCKETID 3 
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 24 BANK 9 
TIME 1369420817 Sat May 25 02:40:17 2013
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: MEMORY CONTROLLER GEN_CHANNELunspecified_ERR
Transaction: Generic undefined request
STATUS d0000080000a008f MCGSTATUS 0
MCGCAP 1000c18 APICID c0 SOCKETID 3 
CPUID Vendor Intel Family 6 Model 47
Hardware event. This is not a software error.
MCE 0
CPU 24 BANK 9 
TIME 1370150089 Sun Jun  2 13:14:49 2013
MCG status:
MCi status:
Error overflow
Corrected error

通过mcelog –client查看似乎是内存错误,但无法确定是哪条内存

[root@read3 ~]# mcelog --client
Memory errors
SOCKET 3 CHANNEL any DIMM any
corrected memory errors:
        8 total
        0 in 24h
uncorrected memory errors:
        0 total
        0 in 24h

SOCKET 3 CHANNEL 0 DIMM any
corrected memory errors:
        1 total
        1 in 24h
uncorrected memory errors:
        0 total
        0 in 24h

使用edac工具检查内存,报错,可能是主板不支持。硬件管理口看到服务器状态正常,目前的问题是如何找到出现错误的内存位置。

[root@read3 ~]# edac-util -v
edac-util: Error: No memory controller data found.

设备信息:

[root@read3 ~]# dmidecode -t 1
# dmidecode 2.11
# SMBIOS entry point at 0x7eba9000
SMBIOS 2.7 present.

Handle 0x0054, DMI type 1, 27 bytes
System Information
        Manufacturer: IBM
        Product Name: System x3850 X5 -[7143TUF]-
        Version: 06
        Serial Number: 06L3677
        UUID: 0D009FC8-E7F6-11E1-81AA-2440B5B13980
        Wake-up Type: Power Switch
        SKU Number: Not Specified
        Family: System X

参考:
MCELOG:
http://mcelog.org/manpage.html
https://www.cnblogs.com/zhangxinglong/p/10886697.html
https://blog.csdn.net/owlcity123/article/details/107016689