Wednesday, May 10, 2006

PSH + SMF = less downtime

Richard's Ranch has posted some info on Memory Page Retirement in Solaris which is part of Predictive Self Healing. If you want more details you should read: Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults.

What a coincidence because just one day earlier one of our servers encountered uncorrectable memory error. Fortunately it happened in user space so Solaris 10 just cleared that page, killed affected application and thanks to SMF application was automatically restarted. It all happened not only automatically but also quick enough that our monitoring detected problem AFTER Solaris already took care of it and everything was working properly.

Here we have a report in /var/adm/messages about problem with memory.

May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 321281 kern.warning] WARNING: [AFT1] Uncorrectable Memory Error on CPU0 Data access at TL=0, errID 0x000c303b.ed832017
May 8 22:47:03 syrius.poczta.srv AFSR 0x00000000.00200000 AFAR 0x00000001.f0733b38
May 8 22:47:03 syrius.poczta.srv AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xffffffff7e7043c8
May 8 22:47:03 syrius.poczta.srv UDBH 0x00a0 UDBH.ESYND 0xa0 UDBL 0x02fc UDBL.ESYND 0xfc
May 8 22:47:03 syrius.poczta.srv UDBL Syndrome 0xfc Memory Module Board 6 J????
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 714160 kern.info] [AFT2] errID 0x000c303b.ed832017 PA=0x00000001.f0733b38
May 8 22:47:03 syrius.poczta.srv E$tag 0x00000000.18c03e0e E$State: Exclusive E$parity 0x0c
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x00): 0x2d002d01.2d022d03
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x08): 0x2d672d68.2d692d6a
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x10): 0x2d6b2d09.2c912c92
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x18): 0x2c932d0d.2d0e2d0f
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x20): 0x2d102d11.2d122d13
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x28): 0x00000000.09040000
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 359263 kern.info] [AFT2] E$Data (0x30): 0x000006ea.00002090
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 989652 kern.info] [AFT2] E$Data (0x38): 0x2091001c.2d1d2d1e *Bad* PSYND=0x00ff
May 8 22:47:03 syrius.poczta.srv unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x00000001.f0732000
May 8 22:47:03 syrius.poczta.srv SUNW,UltraSPARC-II: [ID 863414 kern.info] [AFT3] errID 0x000c303b.ed832017 Above Error is in User Mode
May 8 22:47:03 syrius.poczta.srv and is fatal: will SIGKILL process and notify contract
May 8 22:47:20 syrius.poczta.srv unix: [ID 221039 kern.notice] NOTICE: Previously reported error on page 0x00000001.f0732000 cleared


Then by just using 'svcs' I learned which application was restarted and looked into the application's smf log file which has (XXXXX put instead of application path):

[ May 8 22:47:03 Stopping because process killed due to uncorrectable hardware error. ]
[ May 8 22:47:03 Executing stop method ("XXXXXXXX stop") ]
[ May 8 22:47:04 Method "stop" exited with status 0 ]
bash: line 1: 22242 Killed LD_PRELOAD=libumem.so.1 XXXXXXXX
[ May 8 22:48:44 Executing start method ("XXXXXXXXX start") ]
[ May 8 22:48:46 Method "start" exited with status 0 ]

1 comment:

Anonymous said...

SMF+FMA=PSH