While setting up a RAW disk image for use in a QEMU-based virtual machine, I became frustrated because QEMU would load GRUB but then GRUB would not load a menu of OSes to boot into. I've come to the conclusion that GRUB must not be able to locate the crug.cfg file, which leads me to believe it's encoded something wrong in the post-MBR gap. Do any tools exist to inspect the contents of this gap?
Here's how I'm installing GRUB into the VM image:
# Disk Image
fallocate -l $((4*1024*1024*1024)) "$file"
DEV=$(sudo losetup --show --nooverlap --find "$file")
# Partition Table
sudo parted "$DEV" mklabel msdos
sudo parted "$DEV" mkpart primary fat16 1MiB 101MiB
sudo parted "$DEV" mkpart primary ext4 102MiB 100%
sudo parted "$DEV" set 1 boot on
sudo mkfs.vfat "${DEV}p1"
sudo mkfs.ext4 -E lazy_journal_init=1 -E lazy_itable_init=1 -E discard "${DEV}p2"
# Mounting, installing base packages, configuration, etc..
# ...
# Bootloader
sudo mkdir "$mountpoint/boot/grub"
sudo install "$grub_default_file" "$mountpoint/etc/default/grub"
sudo arch-chroot "$mountpoint" grub-install --boot-directory="/boot/grub" --target=i386-pc "$DEV"
sudo arch-chroot "$mountpoint" grub-mkconfig -o "/boot/grub/grub.cfg"
The modified $grub_default_file just has some minor modifications to turn on serial output so that I can look around from QEMU's serial console.
GRUB_CMDLINE_LINUX="quiet console=tty0 console=ttyS0,38400n8"
GRUB_TERMINAL="console serial"
GRUB_SERIAL_COMMAND="serial --speed=38400 --unit=0 --word=8 --parity=no --stop=1"
Here's what I've verified so far:
- The MBR contains the string "GRUB", suggesting that grub has installed itself onto the disk image
- The partition table has a sufficiently large post-MBR gap
- The first partition has the boot flag set
- The first partition is a
vfatfilesystem - The first partition contains
/grub/grub.cfgand related files - The
grub.cfgfile contains the correct UUIDs for the first and second partition
The only link I can't really verify is grub locating the partition containing the configuration file. Maybe I set the wrong boot flag. Maybe I chose the wrong filesystem type. Maybe grub encoded the location of that partition incorrectly in the MBR/post-MBR gap. It's quite hard to debug.
$ sudo dd if=zonemanager bs=$((2048*512)) count=1 | strings | grep -i grub
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00742499 s, 141 MB/s
GRUB
$ sudo parted /dev/loop1
GNU Parted 3.2
Using /dev/loop1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) unit s
(parted) p
Model: Loopback device (loopback)
Disk /dev/loop1: 8388608s
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 2048s 206847s 204800s primary fat16 boot, lba
2 208896s 8388607s 8179712s primary ext4
$ sudo file -s /dev/loop1
/dev/loop1: DOS/MBR boot sector
$ sudo file -s /dev/loop1p1
/dev/loop1p1: DOS/MBR boot sector, code offset 0x3c+2, OEM-ID "mkfs.fat", sectors/cluster 4, reserved sectors 4, root entries )
$ sudo lsblk -f
NAME FSTYPE LABEL UUID FSAVAIL FSUSE% MOUNTPOINT
loop1
├─loop1p1 vfat 58D5-B48F 45.1M 55% /mnt/boot
└─loop1p2 ext4 0014f737-33b7-4dba-be4a-2b186e2e46a0 2.1G 39% /mnt
$ grep 58D5-B48F /mnt/boot/grub/grub.cfg
search --no-floppy --fs-uuid --set=root 58D5-B48F
search --no-floppy --fs-uuid --set=root 58D5-B48F
search --no-floppy --fs-uuid --set=root 58D5-B48F
search --no-floppy --fs-uuid --set=root 58D5-B48F
search --no-floppy --fs-uuid --set=root 58D5-B48F
search --no-floppy --fs-uuid --set=root 58D5-B48F
$ grep 0014f737-33b7-4dba-be4a-2b186e2e46a0 /mnt/boot/grub/grub.cfg
menuentry 'Arch Linux' --class arch --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-0014f737-33b7-4dba-be4a-2b186e2e46a0' {
linux /vmlinuz-linux root=UUID=0014f737-33b7-4dba-be4a-2b186e2e46a0 rw quiet console=tty0 console=ttyS0,38400n8 quiet
submenu 'Advanced options for Arch Linux' $menuentry_id_option 'gnulinux-advanced-0014f737-33b7-4dba-be4a-2b186e2e46a0' {
menuentry 'Arch Linux, with Linux linux' --class arch --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-linux-advanced-0014f737-33b7-4{
linux /vmlinuz-linux root=UUID=0014f737-33b7-4dba-be4a-2b186e2e46a0 rw quiet console=tty0 console=ttyS0,38400n8 quiet
menuentry 'Arch Linux, with Linux linux (fallback initramfs)' --class arch --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-linux-fal{
linux /vmlinuz-linux root=UUID=0014f737-33b7-4dba-be4a-2b186e2e46a0 rw quiet console=tty0 console=ttyS0,38400n8 quiet
Mistake #1: grub-install --boot-directory should have been /boot and not /boot/grub.
Current Investigation #1: I compared the grub.cfg to a known working QEMU VM. Other than the UUIDs of the devices, there's only two differences:
set root=(hd0,1)- Extra arguments to
search
WORKING
set root='hd0,msdos1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-ieee1275='ieee1275//disk@0,msdos1' --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 0959-F5DD
NON-WORKING
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root ECB4-BE7A
Current investigation #2: When in the grub shell, I can locate and boot into early userspace.
GNU GRUB version 2.02
Minimal BASH-like line editing is supported. For the first word, TAB
lists possible command completions. Anywhere else TAB lists possible
device or file completions.
grub> set pager=1
grub> echo $feature_platform_search_hint
y
grub> ls
(hd0) (hd0,msdos2) (hd0,msdos1) (fd0)
grub> ls (hd0,msdos1)/
vmlinuz-linux initramfs-linux.img initramfs-linux-fallback.img grub/
grub> set root=(hd0,msdos1)
grub> linux /vmlinuz-linux root=UUID=0014f737-33b7-4dba-be4a-2b186e2e46a0 rw quiet console=tty0 console=ttyS0,38400n8 quiet
grub> initrd /initramfs-linux.img
grub> boot
Starting version 242.32-2-arch
Further, using the configfile command actually loads the menu!
grub> configfile /grub/grub.cfg
So this leads me to believe that grub can't find the config file for some reason.
Mistake #2: Boot the right damn image file.
To eliminate a variable between the working and non-working VMs, I had converted the non-working VM from a raw disk image to a qcow2 image. I had been booting that image for the last few hours because I never reverted the systemd unit. Everything after "Mistake #1" was a red-herring. I'm going to leave it up though, as it's a good learning aid.