Tiny Core Linux

Tiny Core Base => Micro Core => Topic started by: adrianallen on November 13, 2014, 10:43:43 PM

Title: Noautologin apparently doesn't work sometimes?
Post by: adrianallen on November 13, 2014, 10:43:43 PM
Sorry that my first post has to be somewhat complex - forgive me if the post seems overlong, I'm trying to include any details that might be relevant.

I've built a microcore-based recovery environment which installs into the boot partition on a large (several thousand) number of redhat based servers which are widely distributed geographically for remote management and recovery purposes - i.e. if the main OS gets hosed up somehow this gives us a chance to fix things. If you're interested in implementation details let me know and we can discuss in another topic.

In GRUB legacy (Redhat Enterprise distros aren't moving to GRUB2 until 7.x which we haven't updated to yet) the primary OS is option 0, and microcore is option 1.

Grub is configured by default to boot microcore in grub.cfg. When the main OS boots successfully, it runs a script from rc.sysinit which essentially does this:

echo  "savedefault --default=0 --once" | grub --batch

which causes the next boot attempt to again use the primary OS.

The idea here is that if the main OS fails to boot sometime between GRUB loading and completing all the init scripts, then all you have to do is power cycle the box and it will come back up in microcore, from where you can do disk diagnostics etc. - this all works fine and as expected.

However - for reasons unknown to me, if I manually set GRUB to boot microcore on the next boot attempt from within the primary OS, the "noautologin" option seems to have no effect. Microcore comes up normally and does everything it's supposed to do from within bootsync and bootlocal, but root is logged in instead of displaying a login prompt.

If I reboot from there, allowing the box to come up again in Microcore, noautologin works on the second attempt.  So far this seems to be very consistent but not 100% consistent, at least on my Intel motherboard test system - on a test box with an MSI motherboard it doesn't seem to happen (???). I have no idea why anything to do with hardware would affect how GRUB passes kernel options to Microcore?  It's possible the hardware difference is in my imagination and that the problem is actually random and just happens to have occurred in nearly all my tests on the Intel system and in nearly none of the tests on the MSI board but...

The fact that it works correctly on the second reboot seems to imply that the option is there and correctly configured. I have also added a line like echo "booting" > /etc/sysconfig/noautologin" to /opt/bootlocal.sh as well as verifying on the system once it has booted into a root shell instead of booting to a login prompt that "cat /proc/cmdline" shows noautologin as well as that /etc/sysconfig/noautologin exists (even without my change to bootsync.sh noted above) but somehow the system still comes up in a root shell.

 Any and all suggestions would be welcome - I'm extensively experienced with Redhat/Fedora variants but not as much with Gentoo (although I did run it as my desktop OS for a few months to get a feel for it).

I'm running Core 5.2 and the host OS is Scientific Linux 6.5.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: gerald_clark on November 13, 2014, 11:05:10 PM
It sounds like you have made some changes, as even without noautologin it would boot up and end up in a tc shell, not a root shell.
Also, I have no idea why you brought up Gentoo.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: curaga on November 14, 2014, 01:24:50 AM
Doublecheck what is in root's .profile if you've modified it, and perhaps try adding some debug statements in there to see what happens.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: adrianallen on November 14, 2014, 10:31:21 AM
Thanks for the replies.

Sorry about the Gentoo confusion, for some reason I had it in my head that Core was based on a stripped down version of Gentoo but I think that's actually the System Rescue CD which was the first distro I investigated for this project.

In response to the questions:

I have added a regular user as well as the root user. The tc user still exists also as well as the other default users.

My bootsync script has very minor modifications to do with setting the hostname based on the MAC address, updating resolv.conf and /etc/issue, and copying some files into place (I haven't incorporated these files into the base image because they change from time to time, so I keep them in /boot/opt on the host OS where I can use Puppet to update them, and then the updated versions get put into place during the boot process of Microcore - it's stuff like an authorized_keys file for ssh,  etc).

The bootlocal script is somewhat extensively modified as it loads kernel modules and drivers for some custom hardware like a small LCD screen which displays the unit's status to the users at the remote site, starts iptables with a predefined set of rules, brings up an openvpn tunnel so that I can remotely access the unit and then starts sshd, activates the local volume groups and performs an automated fsck on the main OS's volumes, etc. - but I don't understand how any of those actions would cause the system to log in as root?

The .profile file contains only two lines - one runs a script displaying the health of the box upon user login (with data gathered from smartctl etc.) and the other echos an information statement to the user.  In any case it's my understanding that .profile should not be executed until a user logs in, so that specific file shouldn't be able to influence the outcome of a boot attempt, should it?

The main thing I do not understand is the inconsistent behavior. This morning when I began testing, the Intel system booted just as expected - to a login prompt. On every subsequent boot attempt it has dumped me to a root shell.  Yesterday it booted to a root shell eight or nine times in a row and never booted correctly.  Each time, if I allow it to boot to the root shell and then simply reboot it again, it boots correctly (to a login prompt).

One thing I wonder - since I install Microcore on the target system via Puppet, as well as maintaining the state, ownership, and permissions of the config files and scripts in /boot/opt, is it possible that some file there has an incorrect permission or ownership which gets restored on the first Microcore boot attempt so that it subsequently works correctly, until the next time I boot to Scientific where Puppet runs and breaks it again?

This doesn't seem likely since the noautologin option appears to be passed to Microcore correctly (see my first post) and I don't think that scripts or files inside the OS would affect how the OS is loaded?
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: gerald_clark on November 14, 2014, 10:42:58 AM
Since bootsync.sh and bootlocal.sh run before /etc/inittab is scanned, I suspect your problem is there.
Something is not running correctly and you land in a shell before login is executed.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: adrianallen on November 14, 2014, 03:40:59 PM
Thanks for the suggestion.

To cut a long story short, I spent all day today experimenting with different contents of bootsync.sh and bootlocal.sh.  The issue apparently stems from something in bootlocal, but I have been unable to determine what after approximately 75 reboots with different parts of the script commented out.  It seems like something as simple as copying a file somewhere and altering its permissions triggers the problem.

I am also completely unable to reproduce the issue on the system with an MSI motherboard which is just nonsensical to me.  Same CPU, same specs, same model hard disk, precisely the same operating system with exactly the same settings (I even reimaged both systems today to be sure, although they're both managed entirely by Puppet so everything should be the same in any case), identical bootlocal/bootsync scripts - and one of them behaves as expected while the other does not. The only difference is the motherboard. 

I don't get it :/
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: gerald_clark on November 14, 2014, 03:46:52 PM
run memtest86+ overnight on the problem machine.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: adrianallen on November 14, 2014, 07:55:29 PM
Ok, after more hair-pulling, I am 90% sure I have found the issue - your comment about bootsync and bootlocal getting run before /etc/inittab pointed me in the right direction.

Although there's nothing in either bootosync or bootlocal that is causing the problem - I reasoned based on your statement that the system was actually *logging in* as the root user to process bootsync and bootlocal, and then processing inittab.  There is a script in .profile for the root user that outputs information about the hardware, including some parsed info from dmidecode about the BIOS.  On the Intel system, although the output seems correct, for some reason this script has a nonzero return code - so when the root user is instantiated, this script was running and, although it didn't seem to have a problem, the error return code apparently caused the boot process to fail at that point.

Commenting out the script from root's .profile caused three consecutive normal boots. I will test more thoroughly on Monday and report back - many thanks for your suggestions (and I do realize that you initially pointed me toward .profile in the first place, I just couldn't see how creating a formatted output of some system variables would break boot at the time).
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: gerald_clark on November 14, 2014, 08:14:21 PM
Glad to hear you have it working, and good luck with your project.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: adrianallen on November 17, 2014, 11:11:38 AM
Well, unfortunately it turns out that was not the issue.

After a few more hours of experimenting, I believe I have now resolved this by the simple expedient of editing /sbin/autologin to never automatically log in and always just do a getty 38400 tty1.

I don't understand why the "noautologin" prompt works on the second attempt to boot microcore but never the first attempt - grub isn't changing. The issue isn't related to defective hardware as any of our Intel systems will exhibit this behavior. The fact that the problem doesn't happen 100% of the time leads me to believe there is some kind of race condition happening during boot but since I lack the time and knowledge to trace that down, making the system incapable of automatically logging in seems to have worked.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: gerald_clark on November 17, 2014, 11:42:27 AM
If you think you are experiencing race conditions, consider this.
Bootlocal.sh runs in the background.  You cannot be sure it is finished running before the logins occurs.
Move everything in bootlocal.sh to bootsync.sh and see if the problem disappears.
Title: Re: Noautologin apparently doesn't work sometimes?
Post by: adrianallen on November 17, 2014, 12:01:41 PM
Yeah, I tried that also and got inconsistent results - it seemed to reduce the problem, but since booting to Scientific, setting the next boot target to Microcore, then rebooting again takes about 5 minutes per cycle (there are some PCI cards in these boxes that take a while to init on boot) I'm never sure if I just got lucky a few times in a row or if the problem is actually behaving differently.

In this case, although my fix is hacky, I don't *ever* want the box to autologin, so this works out fine - I performed ten consecutive boots and never saw the issue.