Rasdaemon: Difference between revisions

From Alpine Linux
(Created page with "Machines having ECC memory and supported chipsets can be monitored for ECC errors using rasdaemon (actually rasdaemon monitors much more so it could be useful even without ECC memory). == Installing == {{Cmd|apk add rasdaemon}} == Logging== Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails. Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example...")
 
mNo edit summary
Line 1: Line 1:
Machines having ECC memory and supported chipsets can be monitored for ECC errors using rasdaemon (actually rasdaemon monitors much more so it could be useful even without ECC memory).
Rasdaemon is a Platform Reliability, Availability and Serviceability monitoring tool which can, among other things, monitor ECC memory errors on supported platforms.


== Installing ==
== Installing ==

Revision as of 15:15, 1 February 2025

Rasdaemon is a Platform Reliability, Availability and Serviceability monitoring tool which can, among other things, monitor ECC memory errors on supported platforms.

Installing

apk add rasdaemon

Logging

Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails.

Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example faulty memory module has generated few errors):

# ras-mc-ctl --errors Memory controller events: 1 2025-01-30 01:42:46 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#0channel#0 location: 0:0:0:-1, addr 0, grain 6, syndrome 355 2 2025-01-30 02:34:53 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#1channel#1 location: 0:1:1:-1, addr 0, grain 6, syndrome 23816 No PCIe AER errors. No Extlog errors. No devlink errors. No disk errors. No Memory failure errors. MCE events: 1 2025-01-30 01:42:46 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0xac302a80, misc=0xd01a000401000000, walltime=0x679abcf6, cpuid=0x00a50f00, bank=0x00000011, microcode=0x0a500011 2 2025-01-30 02:34:53 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x211cca540, misc=0xd01a000801000000, walltime=0x679ac92d, cpuid=0x00a50f00, bank=0x00000012, microcode=0x0a500011