Rasdaemon

From Alpine Linux

Rasdaemon is a Platform Reliability, Availability and Serviceability monitoring tool which can, among other things, monitor ECC memory errors on supported platforms.

Installing

apk add rasdaemon

Logging

Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails.

Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example faulty memory module has generated few errors):

# ras-mc-ctl --errors Memory controller events: 1 2025-01-30 01:42:46 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#0channel#0 location: 0:0:0:-1, addr 0, grain 6, syndrome 355 2 2025-01-30 02:34:53 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#1channel#1 location: 0:1:1:-1, addr 0, grain 6, syndrome 23816 No PCIe AER errors. No Extlog errors. No devlink errors. No disk errors. No Memory failure errors. MCE events: 1 2025-01-30 01:42:46 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0xac302a80, misc=0xd01a000401000000, walltime=0x679abcf6, cpuid=0x00a50f00, bank=0x00000011, microcode=0x0a500011 2 2025-01-30 02:34:53 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x211cca540, misc=0xd01a000801000000, walltime=0x679ac92d, cpuid=0x00a50f00, bank=0x00000012, microcode=0x0a500011