WelcomeWelcome | FAQFAQ | DownloadsDownloads | WikiWiki

Author Topic: Remote logging for Core 4.4  (Read 2776 times)

Offline maro

  • Hero Member
  • *****
  • Posts: 1228
Remote logging for Core 4.4
« on: April 17, 2012, 12:11:11 AM »
My interest in this subject was triggered by this thread, and whilst I usually take note when a new TC-specific boot code gets added (and then "take it for a spin") I somehow missed the addition of 'rsyslog=HOST[:PORT]'.

I've now taken the opportunity to give it a try: As usual I used a QEMU VM for this exercise, and borrowing the choice of wording I came across in one of the blog entries I read during my "discovery" I'd like to refer to the (presumably) "crashy" machine as the patient (in this case the guest VM) and the logging machine as the doctor (in this case the host system).

I therefore ran on the doctor nc -lup 6666 | tee LOG to provide a recipient for the messages and started the VM with boot code 'rsyslog=10.0.2.2:6666' (plus also added the boot code 'debug' to ensure that all kernel messages are being captured, and 'printk.time=1' to obtain some time stamps).

I noticed that 'syslogd' only starts sending out messages via UDP as the boot process is pretty much completed (and all the earlier ones are "lost"). From then on all messages make their way into '/var/log/messages' as well as getting send to the remote listener, but I guess in many cases (where issues arise much earlier, e.g. during the kernel boot phase) this means that the really "interesting" ones are not available to the doctor.

So far so bad, but it just so happens that about a year ago I stumbled over the 'netconsole' boot code, which in theory could be used to log kernel messages over the network (and hence be used to remotely capture the kernel activities from the the very start of the boot process). The TC 3.x kernel at that time was not built with support for it, and I assumed the same would be the case for the latest 3.0.21 one as well, but to my surprise the kernel configuration now includes 'CONFIG_NETCONSOLE=y'.

The configuration parameter for this boot code has the following format:
    netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
(with more details can be found here). The need to supply a target IP (and port, i.e. '10.0.2.2' and '6666' in my test) is pretty obvious. But initially using 'netconsole=@/,6666@10.0.2.2/' just lead to the following few (relevant) lines in the 'dmesg' output:
Quote
[    0.818983] netconsole: local port 6665
[    0.819213] netconsole: local IP 0.0.0.0
[    0.819318] netconsole: interface 'eth0'
[    0.819425] netconsole: remote port 6666
[    0.819531] netconsole: remote IP 10.0.2.2
[    0.819715] netconsole: remote ethernet address ff:ff:ff:ff:ff:ff
[    0.819897] netconsole: eth0 doesn't exist, aborting.
[    0.820269] netconsole: cleaning up
And of course no messages were sent to the doctor, so the standard Core 3.0.21 kernel is only half of what is needed when one wants to use this boot code. The critical message makes sense, as at this early point in the boot process no NIC module is loaded. Therefore the only way I could think of to overcome this hurdle is to not provide the NIC driver as a module, but rather include it directly as part of the kernel.

To identify the required driver I used grep DRIVER /sys/class/net/*/device/uevent (which turned out to be 'e1000' in my case), and then went on to build a new 'bzImage' (with an altered kernel configuration) by executing:
    sed -i "/^CONFIG_E1000=/s#=m$#=y#" .config && make bzImage

Using the same boot code as above lead to:
Quote
[    1.357148] netconsole: local port 6665
[    1.357370] netconsole: local IP 0.0.0.0
[    1.357478] netconsole: interface 'eth0'
[    1.357584] netconsole: remote port 6666
[    1.357688] netconsole: remote IP 10.0.2.2
[    1.357871] netconsole: remote ethernet address ff:ff:ff:ff:ff:ff
[    1.357963] netconsole: device eth0 not up yet, forcing it
[    1.397939] netconsole: carrier detect appears untrustworthy, waiting 4 seconds
[    5.401680] netconsole: no IP address for eth0, aborting
[    5.401927] netconsole: cleaning up
Which was clearly an improvement, but not yet the intended result. The NIC was presumably up, but without an IP address. This makes sense as the 'udhcpc' process (which typically performs this task) would run only much later.

After a bit of poking around I managed to succeed by specifying a source IP for the NIC directly in the 'netconsole' parameter (i.e. 'netconsole=@10.0.2.15/,6666@10.0.2.2/'), which lead to:
Quote
[    1.530105] netconsole: local port 6665
[    1.530378] netconsole: local IP 10.0.2.15
[    1.530486] netconsole: interface 'eth0'
[    1.530594] netconsole: remote port 6666
[    1.530707] netconsole: remote IP 10.0.2.2
[    1.530891] netconsole: remote ethernet address ff:ff:ff:ff:ff:ff
[    1.531074] netconsole: device eth0 not up yet, forcing it
[    1.549573] netconsole: carrier detect appears untrustworthy, waiting 4 seconds
[    5.597734] console [netcon0] enabled
[    5.598085] netconsole: network logging started
PLUS the kernel messages also showed up at the doctor.

Well, I have to qualify the last statement insofar as I've have seen different results depending on the host system and QEMU version: on my WinXP system (with QEMU v0.11.1) there were some kernel messages that were not sent out over the network. Whilst when using TC 4.4 as host with 'qemu.tcz' (which is v1.0) ALL messages that were in the 'dmesg' output were also received by the doctor. My assumption is that at the moment that the netconcole becomes available flushing out all existing messages via UDP can lead to the loss of some of them.

Nevertheless I'm now keen enough after this exploration that I'll go ahead and figure out all the drivers for all the NICs in my possession (the "real" ones as well as the virtual ones), with a view to build a custom kernel containing all these drivers.


Offline curaga

  • Administrator
  • Hero Member
  • *****
  • Posts: 11056
Re: Remote logging for Core 4.4
« Reply #1 on: April 17, 2012, 06:07:00 AM »
Inspired by your post, I tried the rsyslog bootcode too. It seems the start messages aren't getting around because they are sent at the time of getting them (for kernel messages already in the buffer, it's at the time klogd is started), which is before the network is up.

If you start syslogd and klogd after the network is up, everything is received.
The only barriers that can stop you are the ones you create yourself.