Author Topic: Catching Kernel Dumps (Read 3892 times)

llondel · « **on:** March 04, 2011, 10:43:46 AM »

I've progressed with my 486sx stuff to the point where everything boots and runs happily most of the time. However, I have a repeatable hard kernel crash that I'm trying to track down. It appears to be resource-related, in that if I take the bootable memory stick and plug it into my laptop it copes properly. However, that's got 1.5GB of RAM, whereas the 486sx target only has 128MB. The running system only consumes about 30-34MB so it's not a memory issue as such, more a case of how kernel resources are allocated.

The problem is that when the kernel oops occurs, it scrolls all the useful information off the top of the console screen. None of my attempts to catch it with syslog or redirect the console to a serial port have worked so I assume I'm missing some compilation option. I resorted to videoing the screen when provoking the crash, which mostly fails due to persistence and frame rates, but I did catch a frame that clearly mentioned a kernel null pointer.

I even know which call is causing the problem, it's a line

res = setsockopt(s_, SOL_IP, IP_ADD_MEMBERSHIP, (void*)&req, sizeof(req));

which will execute happily in normal operation, dealing with UDP multicast sockets. The parameters all make sense.

So, does anyone know what I need to do to get the kernel oops output saved somewhere useful? GDB doesn't work either, that blows up with a segmentation fault of its own.

tinypoodle · « **Reply #1 on:** March 04, 2011, 01:58:55 PM »

Quote from: llondel on March 04, 2011, 10:43:46 AM

None of my attempts to catch it with syslog

Care to elaborate which parameters you tried exactly so far?

Rich · « **Reply #2 on:** March 04, 2011, 03:03:17 PM »

Hi llondel
For starters that (void*) does not belong there. Setsockopt expects an address to a structure with
this particular option. I'm guessing that (void*) was put there to make a compiler error or warning
go away. If you define req as "struct ip_mreq req;" the compiler should be happy. There's a nice
write up at http://tldp.org/HOWTO/Multicast-HOWTO-6.html if you're interested. Be sure to include
<netinet/in.h> not <linux/in.h>.

llondel · « **Reply #3 on:** March 05, 2011, 10:21:46 AM »

Let's see... I can't use a built-in serial port because they don't appear to exist, and a check of dmesg agrees that it didn't notice any either. With hindsight I should have bought at least one unit with built-in serial ports...

I tried console=/dev/ttyUSB0, console=ttyUSB0 and variants such as console=ttyUSB0,9600n8 with a USB-to-serial adapter plugged in. It had an effect in that the console output no longer appeared on-screen but it didn't appear out of the serial port. Microcom would happily talk to it, so the basic serial link is there.

I've tried redirecting syslog to a remote machine, but it doesn't pick up what I want (but will respond to logger commands). I'm guessing that's because the Busybox version doesn't handle it. Attempting to write syslog to the USB stick (only non-volatile memory in the system) didn't work either.

The crash is not time-critical, in that I can have a system up and running and install anything else necessary before provoking the crash, the trick is knowing what I need.

It won't do it on a virtual machine booting the same code on a Core 2 processor, even when restricted to a single core and 128MB, so it is possible that the problem itself is related to it being a 486 machine.

As for the (void*), that's deep down in library code. Not mine, honest guv!

It's actually a struct ip_mreqn and works happily most times.

curaga · « **Reply #4 on:** March 05, 2011, 10:25:57 AM »

In your tests on bigger machines, you have forced the fpu emulation on right?

Rich · « **Reply #5 on:** March 05, 2011, 11:45:29 AM »

Hi llondel
I wasn't judging, merely an observation. Sometimes an inappropiate cast to shut up the compiler can lead to
subtle bugs. It's possible the USB stick is to slow to keep up. How about if you have your laptop share a
subdirectory that's located in RAM, have your 486sx mount it, move the syslog to the mounpoint and add a
link from the old location to the new one. Even simpler would be to start nc (netcat) on the laptop listening
on a port and dumping to a file, then redirect (or pipe?) the console on your 486sx through nc to the laptop.

llondel · « **Reply #6 on:** March 05, 2011, 04:03:43 PM »

***----------------------------------------------********////////////

Quote from: curaga on March 05, 2011, 10:25:57 AM

In your tests on bigger machines, you have forced the fpu emulation on right?

Yes, booted with the no387 option. Doing a cat /proc/cpuinfo also claimed no fpu in that case.

My first thought was that it might be a kernel resource allocated on the basis of available RAM, but the virtual machine will happily run with 64MB, half what the 486 has available. If I could catch the kernel oops trace then I hope to be able to work back from the break point and deduce what's happening and why it doesn't happen with the newer CPU.

The application uses UDP sockets and is based on C++ classes. The primitive end where it sets up four sets of resources in succession but tears them down works fine, but normal operation where it holds open the sockets and presumably holds other resources causes the crash. I'm still working through lots of code trying to understand what's going on and hoping for a short-cut as to where to look, hence the original kernel dump question.

llondel · « **Reply #7 on:** March 05, 2011, 04:50:27 PM »

Quote

Even simpler would be to start nc (netcat) on the laptop listening on a port and dumping to a file, then redirect (or pipe?) the console on your 486sx through nc to the laptop.

Just tried this and it picks up routine stuff but still doesn't get the crash dump, which only goes to the console.

What I really need is a 50-line console, any idea how to convince it to give me that?

Rich · « **Reply #8 on:** March 05, 2011, 05:11:48 PM »

Hi llondel
When you say console do you mean you are running without a GUI? If so, you can try the VGA= boot
parameter to push up the screen resolution, that might give you smaller characters and more lines.
http://www.mjmwired.net/kernel/Documentation/svga.txt

llondel · « **Reply #9 on:** March 07, 2011, 02:01:53 AM »

OK, that helped, it gave me 50 lines and actually slows down the scrolling. With a decent camera (set to fast shutter speed) I got a photo of the trap that confirms that now I'm only missing the two top lines when it stops. It's actually a double trap, a kernel null pointer followed by a kernel panic:

Kernel panic - not syncing: Fatal exception in interrupt

Of course, now it's an exercise in getting a debug-enabled set-up so that all the hex numbers have useful text attached. That puts me back more where I think I can make progress, although the full initrd compressed image was rather large, taking up over half the system RAM. Must prune out the unwanted drivers.

Rich · « **Reply #10 on:** March 07, 2011, 10:34:53 AM »

Hi llondel
Here's a long shot you can try to slow the scrolling down. Try using the lpj= boot parameter.
Find the line in dmesg that lists Bogomips, it should show what it used for lpj, double that number
and use it with the boot parameter. Also, be prepared for the possibility that the bugs symptoms
change or disappear when you alter initrd.

llondel · « **Reply #11 on:** March 07, 2011, 12:00:43 PM »

I might be getting closer. The r6040 ethernet driver is the current suspect, right at the bottom of the pile. I'm trying to track down a newer version (or suitable patches) to apply to the r6040.c driver file. There appears to have been a bunch of multicast fixes late last year which may be the answer. However, the version I've got with the Tinycore 2.6.33 (v3.5) source is v0.25 from August 2009, and the fixing patch is for v0.26->v0.27. I dug out a version of the code that claims to be v0.26 but it won't compile as a TC module.

I can see I'll be exercising my Google-fu tonight.

llondel · « **Reply #12 on:** March 07, 2011, 02:50:56 PM »

Quote from: Rich on March 07, 2011, 10:34:53 AM

Also, be prepared for the possibility that the bugs symptoms change or disappear when you alter initrd.

They've been depressingly consistent so far, whether I've compiled with debug symbols or not, and regardless of what ends up in the initrd image.

llondel · « **Reply #13 on:** March 07, 2011, 05:04:34 PM »

I might have solved the problem. I grabbed the 2.6.37.2 kernel sources and built that based on the kernel config I was using with 2.6.33.3. It booted and ran first time, minus ethernet (due to changes to r6040 that moved the phy stuff to a different module that I'd deleted), and when I put that module (libphy) back in, it's come up and seems to be running everything without crashing. I'll have to see if the r6040 patch is required for my application.

Tiny Core Linux

News:

Author Topic: Catching Kernel Dumps (Read 3892 times)

llondel

Catching Kernel Dumps

tinypoodle

Re: Catching Kernel Dumps

Rich

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps

curaga

Re: Catching Kernel Dumps

Rich

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps

Rich

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps

Rich

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps

llondel

Re: Catching Kernel Dumps