Diskless Linux

by Charles M. "Chip" Coldwell

Last updated: Sep 17, 2002
NEW: RPM packages for RedHat Linux 7.3!

Overview

Why Diskless?

I first became interested in building a diskless Linux box when I was involved with a project to build a 64-node, 128 processor Beowulf cluster. The thought of 64 independent disk drives spinning away and consuming a few dozen watts apiece was bad enough, but I was going to be in charge of administering this beast, and it has always been my philosophy that the less there is, the less there is to go wrong. Getting rid of as much hardware as possible and keeping the node configurations as identical as possible was my goal.

It seemed to me that the best way to run a diskless node was to put the root file system (which must be mounted read-write) in a RAM disk, and to use NFS to mount the remaining file systems read-only. Nonetheless, after extensive searching I was not able to find any single article that completely described this technique, although all the information in this article was gathered from various disparate sources on the Internet, in the Linux kernel documentation and in print publications. I have tried to provide pointers to the original sources where I could.

In fact, what my searches revealed to me is that there is a plethora of different ways to network boot, and once booted to run diskless. There are even compelling reasons for network booting if you do have local hard drives. The following table shows some of the possibilities.

Function this article describes some alternatives
network boot ROM PXE Etherboot, Netboot
network boot loader pxelinux bpbatch
root file system RAM disk NFS root, local hard drive

If you choose one option from each row of the table above, there are at least eighteen different combinations. Even though only a subset of these will actually work (nine of them, I think), that is still far more than I want to cover in this article, so I have settled on the options shown in the second column. I'll try to provide pointers to the others in the resources section below. An excellent article by Richard Ferri, "Remote Linux Explained" which appeared in the January, 2002 issue of the Linux Journal, covers some of the other possibilities, notably NFS roots and Etherboot ROMs.

Most of the discussion in this article is aimed at the Beowulf community (e.g. I refer to the network booting computers as "nodes" in what follows), although the technique is described in enough generality that I hope it will find wider applications. One of my personal favorites is to turn old PCs into X terminals, which can usually be done for the cost of a NIC with boot ROMs and some additional memory for a RAM disk to hold the root file system.

Network booting

If your box has no disk drive, then it is going to have to use some other device as the source of the boot loader (e.g. "LILO" or "GRUB") and bootable image (e.g. "vmlinuz"). In the Beowulf biz, folks have tried lots of alternatives. For example, Scyld's Beowulf product supports booting from floppy disk, CDROM, Linux BIOS, flash disk or PXE (see below) in addition to booting from the hard disk we are trying to eliminate. But properly speaking, diskless computers lack any kind of local persistent storage including hard disk drives, floppies and CD-ROMs. This leaves network booting as the only alternative.

Back in the Good Old Days of low-numbered RFCs, companies like NCD worked out that the best way to boot a diskless machine (in the case of NCD, an X-Terminal) was to have it run BOOTP to obtain an IP address and the name of a boot image file from a BOOTP server, then use TFTP to download the boot image from a possibly different server and launch itself from there. Not a whole lot has changed since then except that BOOTP has been expanded and renamed DHCP, and the processes of getting a bootable image to the client is done in two steps on PCs: first a boot loader is downloaded and started, which will then download and start the operating system image.

In the case of a PC, the important thing is that the NIC must be able identify itself as a bootable device to the motherboard BIOS, and if chosen as the boot device, it must be able to download the boot loader and start it. This is not typically something that a run-of-the-mill NIC can do (see the resources section below for a list of some that do). In particular, if you want to boot from the network, you must either choose a NIC that has some sort of boot ROM built in, or you can burn your own boot ROM and install it on the NIC, or you can even boot from a floppy or a local hard disk which contains the boot ROM image.

This article discusses network booting using a PXE boot ROM. PXE stands for "Preboot eXecution Environment", and is a result of Intel's "Wired for Management" initiative. It is a boot ROM standard that is becoming increasingly popular, and several vendors including Intel and 3Com, are offering products which implement it on their NICs. There is also a project to develop an open-source PXE implementation called NILO, for "Network Interface Loader".

The way PXE works is if the NIC is chosen by the motherboard BIOS as the boot device it broadcasts DHCP requests and waits for a response from a server that contains PXE extensions. If it receives such a response, then the NIC assumes that the boot loader file specified in the response can run under PXE, and it will download the boot loader and start it. Before transferring control to the boot loader, the PXE ROM will also put the network parameters from the DHCP response into a known location in memory where the boot loader has access to them. The boot loader will use this information to start a second round of TFTP to download a bootable image from the sever and start it. If the bootable image is a Linux kernel, then you have successfully booted from the network.

Kernel-Level IP Configuration

Diskless computers are creatures of the network they live on, and so one of the most critical steps of the boot process is to get the network interfaces configured. The Linux kernel provides a facility called "Kernel-level configuration" which allows the kernel to configure its network interface at a very early stage of the boot process, even before the root file system is mounted. This is essential if you want to use an NFS root, since the kernel will not have access to any configuration information stored in files before any file systems have been mounted. If you want to use a RAM disk root instead, kernel-level configuration is still very convenient but not absolutely necessary.

There are actually four different ways the kernel can configure its interface. Three of them are the familiar autoconfiguration protocols, namely DHCP, BOOTP and RARP. The fourth is to have the boot loader pass the IP parameters directly to the kernel as a kernel parameter. It should be noted that the kernel knows absolutely nothing about PXE, and therefore it does not have access to the data structures in memory where the PXE ROM stashed its network configuration parameters from the DHCP response (those were lost when the boot loader passed control to the kernel proper). Therefore, if you choose an autoconfiguration protocol such as DHCP for kernel-level configuration, it will cause the kernel to start a second round of DHCP requests (which should get the same response as the first since the MAC address is the same for the kernel as it was for the PXE ROM).

RAM disk root file system

After the kernel-level IP configuration is finished, the next thing the kernel is going to do is look for a root file system. There are basically two schemes that people use to avoid using local storage for the root file system: Of these two I greatly prefer the latter for the following reasons: There are, of course, also disadvantages to using a RAM disk

The way you get the kernel to use a RAM disk root file system is to exploit the "initial RAM disk" feature of the kernel. This feature is fully described in /usr/src/linux/Documentation/initrd.txt. Briefly, the way it works is the boot loader hands the kernel a compressed file system image which the kernel expands and mounts as root. Then the kernel looks for a script called /linuxrc and will run it if it exists. This script would normally be used to load kernel modules, although it could be used for anything. Once the script finishes, the kernel would unmount the RAM disk and then proceed with the normal boot up process. If this script is missing, the RAM disk will remain mounted as root and the kernel will continue with the boot up procedure from there. This is how you can get your box to run out of a RAM disk: if there is no /linuxrc script in the initial RAM disk then it will become a permanent RAM disk.

Implementation

OK, so much for the overview, let's see how this is actually done in practice. The examples below assume a RedHat-like distribution, but it should be clear from the text how to apply the technique to other distributions (most of the differences are in the startup scripts run on boot and found in /etc and subdirectories thereof).

To start with, you will need a server. The server doesn't boot from the network, but it provides services that allow other computers to do so. These services are a PXE-extended DHCP and a TFTP that understands the TSIZE option.

PXE-extended DHCP

The best free implementation available is provided by the Internet Software Consortium. You will need to get version 3.0 or later. The latest version is probably the best, and installing it is the usual routine: download the compressed tarball, uncompress and untar it, run the configure script, run make and then make install. dhcpd runs as a stand-alone daemon, you do not configure your meta-daemon (inetd or xinetd) to run it. You do have to configure your startup scripts to launch it at boot up, on RedHat systems this is traditionally done by the script /etc/rc.d/init.d/dhcpd. If you already have this script (because, e.g. it came with your standard distro) just make sure it launches the daemon you just compiled and that the script itself is run on boot up (try chkconfig(8)).

In addition, you must configure dhcpd to use the PXE extensions and pass the boot loader to the clients. Here's an example /etc/dhcpd.conf configuration file that does exactly this:

# DHCP configuration file for DHCP ISC 3.0

ddns-update-style none;

# Definition of PXE-specific options
# Code 1: Multicast IP address of boot file server
# Code 2: UDP port that client should monitor for MTFTP responses
# Code 3: UDP port that MTFTP servers are using to listen for MTFTP requests
# Code 4: Number of seconds a client must listen for activity before trying
#         to start a new MTFTP transfer
# Code 5: Number of seconds a client must listen before trying to restart
#         a MTFTP transfer

option space PXE;
option PXE.mtftp-ip               code 1 = ip-address;  
option PXE.mtftp-cport            code 2 = unsigned integer 16;
option PXE.mtftp-sport            code 3 = unsigned integer 16;
option PXE.mtftp-tmout            code 4 = unsigned integer 8;
option PXE.mtftp-delay            code 5 = unsigned integer 8;
option PXE.discovery-control      code 6 = unsigned integer 8;
option PXE.discovery-mcast-addr   code 7 = ip-address;

subnet 192.168.1.0 netmask 255.255.255.0 {

  class "pxeclients" {
    match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
    option vendor-class-identifier "PXEClient";
    vendor-option-space PXE;

    # At least one of the vendor-specific PXE options must be set in
    # order for the client boot ROMs to realize that we are a PXE-compliant
    # server.  We set the MCAST IP address to 0.0.0.0 to tell the boot ROM
    # that we can't provide multicast TFTP (address 0.0.0.0 means no
    # address).

    option PXE.mtftp-ip 0.0.0.0;

    # This is the name of the file the boot ROMs should download.
    filename "pxelinux.0";
    # This is the name of the server they should get it from.
    next-server 192.168.1.1;
  }

  pool {
    max-lease-time 86400;
    default-lease-time 86400;
    range 192.168.1.2 192.168.1.254;
    deny unknown clients;
  }

  host node1 {
    hardware ethernet fe:ed:fa:ce:de:ad;
    fixed-address 192.168.1.2;
  }

  host node2 {
    hardware ethernet be:ef:fe:ed:fa:ce;
    fixed-address 192.168.1.3;
  }

  [...]
  
}
The above configuration assumes that you want your computers always to come up with the same IP address. If you are completely indifferent (e.g. you have a cluster of identical nodes and you just don't care which is which), you can replace the line "deny unknown clients" with "allow unknown clients" and remove all of the "host" entries. In this case, you should make sure that the "range" of IP addresses is somewhat larger than the number of nodes, since nodes may go down without releasing their leases, and the server won't reap them again until they expire (after one day in the configuration above). In addition, all the nodes must run a daemon to manage the network interface (e.g. pump) so that their DHCP leases will be renewed before they expire.

TFTP server that understands the TSIZE option

Pxelinux (our preferred boot loader, see the next section) requires a TFTP server that understands the TSIZE ("transfer size") option. The TSIZE option is defined by RFC2349; it is basically a mechanism which allows pxelinux to determine how large a file is before it gets transferred.

tftp-hpa, a modified version of the standard BSD TFTP daemon, works just fine (the "-hpa" suffix stands for H. Peter Anvin, who is also the author of the syslinux and pxelinux programs). The latest version is probably the best. Installing it is the usual routine: download the compressed tarball, uncompress and untar it, run the configure script, run make and then make install. In addition, in.tftpd is always run by a meta-daemon, either inetd (e.g. RedHat Linux versions 6.2 and earlier) or xinetd (e.g. RedHat Linux versions 7.0 and later). You will need to make sure that your meta-daemon is configured to provide the TFTP service and to use the right version of the in.tftpd daemon (i.e. the one you just compiled). The relevant files are /etc/inetd.conf for inetd, and /etc/xinetd.d/tftp for xinetd. Whichever meta-daemon you use, it should be configured to invoke the TFTP daemon as follows

in.tftpd -s /tftpboot
which tells in.tftpd to look in the /tftpboot directory for the files that clients try to download.

The boot loader: pxelinux

The boot loader of choice for booting Linux from a PXE NIC is pxelinux, a part of H. Peter Anvin's Syslinux project. You can download the binary, pxelinux.0, or you can build it from the assembly sources if you have the Netwide Assembler, better known as nasm. Either way, since the boot loader is going to be downloaded by the clients using TFTP, you'd better put it in the directory /tftpboot.

The PXE boot ROM passes control to pxelinux after it has already obtained an IP address for itself and the boot server (otherwise, how could it have downloaded pxelinux in the first place?). After it is started by the boot ROM, pxelinux has access to these values in a data structure left in a known location in memory by the boot ROM. Therefore, the first thing pxelinux tries to do is download a configuration file corresponding to the boot client's IP address from the boot server. This configuration file contains the name of the boot image (i.e. the Linux kernel) that pxelinux should download and any kernel parameters that pxelinux should give to it. In addition, if one of these kernel parameters specifies an initial RAM disk for the kernel, pxelinux will download the compressed file system image before starting the kernel.

The pxelinux configuration file for a boot client is found in the directory /tftpboot/pxelinux.cfg and given a name which is the client's IP address in hexadecimal. If the file does not exist when pxelinux tries to download it, it will remove the last octet and try again, repeating until it runs out of octets. For example, if the client was assigned address 192.168.1.2, then it will try to download the following configuration files from the boot server

/tftpboot/pxelinux.cfg/C0A80102
/tftpboot/pxelinux.cfg/C0A801
/tftpboot/pxelinux.cfg/C0A8
/tftpboot/pxelinux.cfg/C0
stopping at the first one that succeeds, and giving up if the last one fails. In the case of a network of identical nodes, this allows you to set up a single configuration file for the whole lot of them.

The contents of the pxelinux configuration files look something like the following:

DEFAULT linux
APPEND initrd=rootfs.gz root=/dev/ram rw ip=192.168.1.2:192.168.1.1:192.168.1.1:255.255.255.0:node1:eth0:off
The DEFAULT line gives the name of the bootable image file, in this case the client will expect the file /tftpboot/linux on the server to contain a compressed linux kernel image. The APPEND line is a list of parameters passed to the kernel when it boots. The example above deserves some scrutiny.

Digression on NFS root

If you prefer to keep your root file systems (one per node) on an NFS server instead of serving out the same RAM disk root file system to all nodes, then all you need to do is modify the kernel parameters in the APPEND line in the pxelinux configuration file (and make sure that the NFS server is exporting the volumes, of course). Something like
APPEND root=/dev/nfs nfsroot=192.168.1.1:/export/root/node1,rw ip=[...]
ought to do it (the ip= parameter is just the same as above). Warning: make sure the NFS server isn't "root squashing" on this volume or else it will be worthless as a root file system.

Building the kernel

There are a number of things you must do when building the kernel that will run on the diskless nodes. First, it's probably a good idea for the network device driver to be compiled into the kernel instead of being a loadable module. This is a prerequisite for using kernel-level IP configuration, since it happens before any file systems are mounted and therefore there is no way for the kernel to load a module that is stored on disk. Kernel-level configuration is itself required for an NFS root file system, since the kernel has no way of mounting any file systems before the network interface is up, and in fact menuconfig will not even allow you to configure a kernel with an NFS root unless you configure it for kernel-level IP configuration first. On the other hand, if you use a RAM disk root file system, it is possible to put a loadable module in the RAM disk and then load it before the network interface has to come up, but personally, I find it more convenient to use the kernel-level configuration facility.

To configure the kernel-level configuration facility, the "menuconfig" option to select is in the "Networking options" submenu and is called "IP: kernel-level configuration support". You also get to select which protocols to support (DHCP, BOOTP and RARP). If you plan to use the ip= kernel parameter to specify an autoconfiguration protocol then make sure you have enabled the same protocol in the kernel configuration before you compile. If you plan to use the ip= kernel parameter to fully specify the networking parameters, then you can choose to enable none of the autoconfiguration protocols.

To use a RAM disk root file system, you must enable "RAM disk support" in "Block devices", and you must increase the default RAM disk size to something reasonable like 64 megabytes (remember, it has to hold the whole root file system). An alternative to increasing the default RAM disk size is to add the ramdisk= or ramdisk_size= kernel parameter to the APPEND line, giving it a value which is the actual size of the RAM disk in kilobytes. Also, be sure to enable "Initial RAM disk (initrd) support". This option only becomes visible after you have selected RAM disk support to be compiled into the kernel.

If you plan to use an NFS root file system, then you should also enable "Root file system on NFS" in the "Network File Systems" menu under the "File Systems" menu. This option only becomes visible after you have enabled "IP: kernel-level configuration support".

By and large, I figure you might as well turn off loadable module support completely, since there really won't be room in the tiny root file system that fits in the RAM disk for a lot of them anyway. But I'm sure there are folks out there just waiting to prove me wrong, so I won't tempt them by making a stronger statement on the subject. Suffice it to say that when I boot a node from the network, it gets a bare bones kernel with just the drivers it needs and no loadable module support.

Once you have your kernel configured, build the compressed bootable image with make bzImage and copy this file into /tftpboot. In the examples above I have assumed this file is called /tftpboot/linux.

Building the RAM disk root file system

Now comes the real art: paring down your root file system to the bare essentials. Most Linux distros come with so much crap these days that the root file system is running into hundreds of megabytes. But the kernel itself is happy in a much more confined space, and the process of removing the cruft is a pleasure analogous to pulling the weeds out of your garden. You'll probably discover lots of stuff you weren't using in there, and certainly a lot of stuff that your cluster node or X terminal will never use. If rpm is your package manager, you might find the --root option helpful. For example
rpm --root /loop -e cruft-0.95.647-4.7pl1407beta
will remove the package named "cruft" from the file system rooted at /loop.

To make life easier for myself, I wrote a makefile as an aid for developing a bare bones root file system:

mount: rootfs
        mount -o loop rootfs /loop || true

newfs:
        dd if=/dev/zero of=newfs bs=1k count=65536
        mke2fs -F -L ROOT newfs < /dev/null

copyfs: newfs mount
        mount -o loop newfs /loop1
        cp -a /loop/* /loop1
        umount /loop1

umount:
        umount /loop

check: copyfs umount
        mv newfs rootfs
        e2fsck -f rootfs

compress: check
        gzip -c rootfs | dd of=/tftpboot/rootfs.gz
This makefile assumes the directories /loop and /loop1 exist, and /loop will be the mount point of the new root file system during development. The makefile implements the following strategy:
  1. To create a file system, fill a file with zeros and then use mke2fs to build a file system on the file. By filling it with zeros first, we make sure that it will compress efficiently.
  2. To mount the file system, use a loopback device.
  3. After a file system has been mounted a while, it probably has had things added to and removed from it, so it will no longer compress as well as a "fresh" one will. So before compressing, create a new file system and copy everything from the old one into it. This should create a file system that will compress as well as possible.
To get started, run
make newfs
mv newfs rootfs
make
and you will have a pristine 64 megabyte ext2 file system mounted on /loop. You should cd into it and make a few directories
cd /loop
mkdir bin dev etc lib mnt proc root sbin tmp usr var
and then start copying in the files you will need. Determining exactly which files these are is a bit of an art and depends a lot on your application. Some guidelines can be found in the Linux Bootdisk HOWTO. You should plan on mounting /usr read-only from an NFS server (e.g. the cluster frontend). Since most of the important executables are located there, this considerably reduces the size required for the root file system. I have quite usable cluster nodes and X terminals with 64 megabyte root file systems that are half empty.

Once the root file system is fully populated with everything you absolutely need and nothing else (or you're just ready to try another iteration), cd back to the directory containing the makefile and run

make compress
which will create the file /tftpboot/rootfs.gz. If you want to modify the root file system, just run make mount again and it will reappear under /loop.

Some words about startup scripts

The startup scripts that come with most distros make the strong assumption that your computer will keep its file systems on some kind of local hard disk drive. Since this is probably true in over 99% of installs, it makes sense for them to do this. But for this application, we can do without almost all of the hard work these scripts do to validate the file systems.

For example, on RedHat derived distros, the script /etc/rc.d/rc.sysinit will want to start swapping and fsck all the file systems before mounting them. Since we have no disk drives attached, we have no swapping and nothing to fsck. All of these operations need to be removed from this script before you put it in your RAM disk root file system. It's a chance to really read these scripts and understand everything that happens on boot up. Here's a brief outline of what RedHat runs on startup:

  1. /etc/inittab instructs init to run /etc/rc.d/rc.sysinit on startup. This is the script you need to make the most changes to; generally speaking you want to remove the majority of it.
  2. /etc/inittab tells init the default runlevel, and then init calls /etc/rc.d/rc with the runlevel number as its only argument.
  3. /etc/rc.d/rc finds the directory /etc/rc.d/rc${runlevel}.d and first runs all scripts starting with "K" in order (kill scripts), then runs all scripts starting with "S" in order (start scripts). These scripts are all usually symbolic links to scripts in /etc/rc.d/init.d/, and the symbolic links are maintained by running the program chkconfig, which uses the contents of comments at the beginning of a script to determine which runlevels it should be started and killed on.
  4. Depending on the contents of /etc/inittab, mingettys on virtual consoles or gettys on real serial ports can be started up. This is the point at which init would normally launch xdm on an X terminal.

Multicast TFTP

TFTP is a protocol built on UDP, which means there are not really any reliable connections at all. In order to read a file, a client sends a request to a server on UDP port 69, and this request contains the port number where the client expects replies to be sent. After receiving the read request, the server forks and the child process sends responses from some high-numbered UDP port on the server to the specified client port, and the parent process exits. The meta-daemon (inetd or xinetd) should wait for the parent tftpd process to exit before re-opening the listening socket on port 69.

Multicast TFTP is an attempt to allow several clients to download the same file simultaneously through the use of multicast packets. Instead of forking a process for every TFTP client, a multicast server would accept requests from many clients and send file blocks to all of them at once by grouping them into a single multicast address. PXE includes support for multicast TFTP to allow for the possibility of many nodes booting simultaneously; however, tftp-hpa does not.

To avoid overloading the server in the absence of multicast TFTP, one can automate the process of booting nodes sequentially. The etherwake program by Donald Becker can be very helpful by allowing you to use the "Wakeup On LAN" feature available in most modern NICs.

Putting it all together

OK, once you've got all the bits together:
  1. PXE-enabled DHCP server
  2. TSIZE understanding TFTP server
  3. Linux kernel that has
    1. Ethernet device driver
    2. IP: Kernel-level configuration support (possibly including some of DHCP, BOOTP or RARP)
    3. (RAM disk support+Initial RAM disk support) or (NFS root support)
  4. (a compressed root file system image) or (a working NFS export)
you're ready to boot from the network. The boot sequence is as follows
  1. PXE ROMs are selected as the boot device.
  2. PXE ROMs DHCP an IP address, the address of a boot server, and the name of a boot loader.
  3. PXE ROMs stash the IP configuration into a known location in memory, download the boot loader (pxelinux) and start it.
  4. pxelinux reads the IP configuration and downloads a corresponding configuration file.
  5. pxelinux gets the name of the compressed Linux kernel and compressed RAM disk image from the configuration file, downloads those and starts the Linux kernel with parameters from the configuration file.
  6. The Linux kernel configures its network interface from the ip= parameter possibly starting another round of DHCP in the process.
  7. The Linux kernel uncompresses the RAM disk image and mounts it as root, then looks for the script /linuxrc.
  8. When it doesn't find /linuxrc, the kernel continues the boot up process with the RAM disk still mounted as root.
  9. The kernel tries to mount a "real" root file system, which turns out to be the same RAM disk because of the root= kernel parameter from the pxelinux configuration file.
  10. The kernel NFS mounts /usr read-only, and boot up completes normally.

Resources

Boot ROMs

Boot loaders

HOWTOs

There are no fewer than five HOWTOs on the subject of diskless computers and remote booting at the Linux documentation project:

Software

RPM packages for RedHat Linux 7.3

These are some RPMs I rolled from the source distributions of the software listed above. (N.B.: the version of the ISC DHCP server that ships with RedHat Linux 7.3 is 2.0, which does not support the PXE extensions, so you'll need this one even if you already have that one.) The sample dhcpd.conf file that comes with the dhcp package is an example of how to configure the server for a four-node diskless cluster.
Source RPM i386 binary
ISC DHCP server v3.0 dhcp-3.0pl1-1.src.rpm dhcp-3.0pl1-1.i386.rpm
tftp-hpa tftp-hpa-0.30-1.src.rpm tftp-hpa-0.30-1.i386.rpm

Print magazine articles

Miscellaneous links