Rasdaemon: Difference between revisions

From Alpine Linux
(Created page with "Machines having ECC memory and supported chipsets can be monitored for ECC errors using rasdaemon (actually rasdaemon monitors much more so it could be useful even without ECC memory). == Installing == {{Cmd|apk add rasdaemon}} == Logging== Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails. Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example...")
 
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
Machines having ECC memory and supported chipsets can be monitored for ECC errors using rasdaemon (actually rasdaemon monitors much more so it could be useful even without ECC memory).
[https://github.com/mchehab/rasdaemon Rasdaemon] is a Platform Reliability, Availability and Serviceability monitoring tool which can, among other things, monitor ECC memory errors on supported platforms.


== Installing ==
== Installing ==
Line 7: Line 7:
== Logging==
== Logging==


Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails.
Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. [[Logcheck|logcheck]] and automated emails.


Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example faulty memory module has generated few errors):
Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example faulty memory module has generated few errors):
Line 30: Line 30:
2 2025-01-30 02:34:53 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x211cca540, misc=0xd01a000801000000, walltime=0x679ac92d, cpuid=0x00a50f00, bank=0x00000012, microcode=0x0a500011
2 2025-01-30 02:34:53 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x211cca540, misc=0xd01a000801000000, walltime=0x679ac92d, cpuid=0x00a50f00, bank=0x00000012, microcode=0x0a500011
</nowiki>}}
</nowiki>}}
[[Category:Networking]] [[Category:Monitoring]]

Latest revision as of 04:27, 5 February 2025

Rasdaemon is a Platform Reliability, Availability and Serviceability monitoring tool which can, among other things, monitor ECC memory errors on supported platforms.

Installing

apk add rasdaemon

Logging

Rasdaemon logs to syslog. Syslog could be automatically monitored using e.g. logcheck and automated emails.

Additionally rasdaemon logs to /var/lib/rasdaemon/ras-mc_event.db, which could be read using ras-mc-clt (in this example faulty memory module has generated few errors):

# ras-mc-ctl --errors Memory controller events: 1 2025-01-30 01:42:46 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#0channel#0 location: 0:0:0:-1, addr 0, grain 6, syndrome 355 2 2025-01-30 02:34:53 +0200 1 Corrected error(s): Cannot decode normalized address at mc#0csrow#1channel#1 location: 0:1:1:-1, addr 0, grain 6, syndrome 23816 No PCIe AER errors. No Extlog errors. No devlink errors. No disk errors. No Memory failure errors. MCE events: 1 2025-01-30 01:42:46 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=17), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=0, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0xac302a80, misc=0xd01a000401000000, walltime=0x679abcf6, cpuid=0x00a50f00, bank=0x00000011, microcode=0x0a500011 2 2025-01-30 02:34:53 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci Error_overflow CECC, mca DRAM ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=1, mcgcap=0x0000011c, status=0xdc2040000000011b, addr=0x211cca540, misc=0xd01a000801000000, walltime=0x679ac92d, cpuid=0x00a50f00, bank=0x00000012, microcode=0x0a500011