Sunday, January 14, 2018

kernel - Debug faulty hardware


My sister has got a laptop that has always crashed on her during Windows times (Blue screen) (hardware is relatively new and up to date). Back then she sent the dump files from Windows to Dell, who sent an engineer who changed the motherboard, but still after setting up Ubuntu in many different versions using many different kernels the panics would not go away.


So I decided to take action in order to find the exact cause of the problem, I installed and configured the linux-crashdump package (kdump-tools) to automatically start a crash kernel using kexec that generates a dump file of the memory and also stores dmesg output.
I also installed crash, the kernel-image-generic-dbgsym and mcelog in order to have everything to gather as much information as possible.


So the Laptop crashed and the crash kernel successfully generated a dump file and stored the dmesg output.
I also checked out /var/log/mcelog but the file was completely empty, although the daemon was running before the crash, which is strange, but after all we still have the dmesg output, that states:


[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135
[ 3933.364177] mce: [Hardware Error]: RIP !INEXACT! 10: {_raw_spin_lock+0x12/0x50}
[ 3933.364182] mce: [Hardware Error]: TSC a0255fbd7f7 ADDR 42dd14480 MISC d62285
[ 3933.364185] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 1 microcode 15
[ 3933.364186] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 3933.364188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 3: be00000000200135
[ 3933.364190] mce: [Hardware Error]: RIP !INEXACT! 33:<0000045a7992c1b5>
[ 3933.364191] mce: [Hardware Error]: TSC a0255fbd7f0 ADDR 42dd14480 MISC d62285
[ 3933.364194] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 0 microcode 15
[ 3933.364195] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 3933.364196] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 3933.364197] Kernel panic - not syncing: Fatal Machine check

So my first question would be, regarding "Run the above through 'mcelog --ascii'" ... what exactly should I run there and how?
I tried for example:


[ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135 | sudo mcelog --ascii

which simply returned nothing. So what I am supposed to do here?


I also ran


crash  /usr/lib/debug/boot/vmlinux /path/to/crashdump/file

which started the program, as expected and I typed in bt to generate a backtrace which gave me:


PID: 0      TASK: ffff8804177617f0  CPU: 6   COMMAND: "swapper/6"
#0 [ffff88042dd89ca0] machine_kexec at ffffffff8104a732
#1 [ffff88042dd89cf0] crash_kexec at ffffffff810e6ab3
#2 [ffff88042dd89db8] panic at ffffffff8170ec6c
#3 [ffff88042dd89e30] mce_panic at ffffffff8103687a
#4 [ffff88042dd89e70] do_machine_check at ffffffff81038684
#5 [ffff88042dd89f50] machine_check at ffffffff8171e25f
[exception RIP: intel_idle+216]
RIP: ffffffff813dfd78 RSP: ffff88041775de28 RFLAGS: 00000046
RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff81c93220 RDI: 0000000000000006
RBP: ffff88041775de50 R8: ffff88042dd912d0 R9: 000000000000001c
R10: 0000000000000320 R11: 0000000000000249 R12: 0000000000000002
R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff81c932e8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- ---
#6 [ffff88041775de28] intel_idle at ffffffff813dfd78
#7 [ffff88041775de58] cpuidle_enter_state at ffffffff815c9570
#8 [ffff88041775de90] cpuidle_idle_call at ffffffff815c96a9
#9 [ffff88041775ded0] arch_cpu_idle at ffffffff8101ceae
#10 [ffff88041775dee0] cpu_startup_entry at ffffffff810beb85
#11 [ffff88041775df30] start_secondary at ffffffff81040fc8

To sum up, I would like to know, how I can invoke mcelog on the dmesg output and possibly what else steps you would take in order to get as much information about the problem as possible / find the faulty component, so that I can contact the hardware vendor already having an educated guess whats wrong.


I know how that memcheck can help me to predict with high probability that the ram is no the cause.


EDIT: I have found out, how to pass the output to mcelog correctly: Put the output lines before "Run the above through 'mcelog --ascii'" in a file and invoke mcelog with


sudo mcelog --ascii < file

One can see that the "Run the above through 'mcelog --ascii'" message is printed two times in the dmesg file, so I invoked mcelog two times beginning with "CPU:" and ending the line before the message (I left the dmesg stuff like "[ 3933.364173] mce: [Hardware Error]:" away).


So mcelog tells me:


Hardware event. This is not a software error.
CPU 4 BANK 3 TSC a0255fbd7f7
RIP !INEXACT! 10:ffffffff8171d9c2
MISC d62285 ADDR 42dd14480
TIME 1398357146 Thu Apr 24 18:32:26 2014
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Data CACHE Level-1 Data-Read Error
STATUS be00000000200135 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 58
RIP: _raw_spin_lock+0x12/0x50}
SOCKET 0 APIC 1 microcode 15

and


Hardware event. This is not a software error.
CPU 0 BANK 3 TSC a0255fbd7f0
RIP !INEXACT! 33:45a7992c1b5
MISC d62285 ADDR 42dd14480
TIME 1398357146 Thu Apr 24 18:32:26 2014
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Data CACHE Level-1 Data-Read Error
STATUS be00000000200135 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 58
SOCKET 0 APIC 0 microcode 15

so assuming that the motherboard is okay (as it has been changed) and if RAM is okay there is only the CPU left to be the troublemaker right?
Is anyone familiar with all the output given?



Good on you for the instrumentation chosen, that's exactly how to run down a problem like this.


crash dump needs the linux debug symbols which are about 600MB per kernel which is why they're not installed by default. Here's how to install and invoke crash using the symbols.


https://wiki.ubuntu.com/Kernel/CrashdumpRecipe


It's a little late for me at the moment to do an in depth analysis of your machine check but my initial impression is that either the cache on the CPU or main memory is compromised.


I would demand a full warranty replacement.


If that's not possible swap out the ram, which is inexpensive test, and if the problem persists you can be reasonably confident that the CPU is the source. At which point I would seriously consider the trade off of replacing CPU towards the cost of a new computer.


No comments:

Post a Comment

11.10 - Can&#39;t boot from USB after installing Ubuntu

I bought a Samsung series 5 notebook and a very strange thing happened: I installed Ubuntu 11.10 from a usb pen drive but when I restarted (...