Rasdaemon

From Alpine Linux
Revision as of 15:11, 1 February 2025 by Jarp (talk | contribs) (Created page with "Machines having ECC memory and supported chipsets can be monitored for ECC errors using rasdaemon (actually rasdaemon monitors much more so it could be useful even without ECC memory). == Installing == {{Cmd|apk add rasdaemon}} == Logging== Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails. Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Machines having ECC memory and supported chipsets can be monitored for ECC errors using rasdaemon (actually rasdaemon monitors much more so it could be useful even without ECC memory).

Installing

apk add rasdaemon

Logging

Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails.

Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example faulty memory module has generated few errors):

# ras-mc-ctl --errors Memory controller events: 1 2025-01-30 01:42:46 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#0channel#0 location: 0:0:0:-1, addr 0, grain 6, syndrome 355 2 2025-01-30 02:34:53 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#1channel#1 location: 0:1:1:-1, addr 0, grain 6, syndrome 23816 No PCIe AER errors. No Extlog errors. No devlink errors. No disk errors. No Memory failure errors. MCE events: 1 2025-01-30 01:42:46 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0xac302a80, misc=0xd01a000401000000, walltime=0x679abcf6, cpuid=0x00a50f00, bank=0x00000011, microcode=0x0a500011 2 2025-01-30 02:34:53 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x211cca540, misc=0xd01a000801000000, walltime=0x679ac92d, cpuid=0x00a50f00, bank=0x00000012, microcode=0x0a500011