Virtualisation

Warning

This chapter is work-in-progress and may contain vastly incomplete information.

This section will provide a guide on how to provide virtual, isolated environments for users. But first, we will discuss the different environment types there exist and their up and downsides.

Types of Virtualisation

What is Virtualisation?

Roughly speaking, the idea of virtualisation is to provide an isolated environment (from now on called the Guest) where code which is has a trust level less-than-or-equal to the surrounding system (from now on called the Host). More abstractly, virtualisation provides a (more or less) well-defined Interface for a Guest to operate on. This interface may be PC hardware, the POSIX API or the System Call interface of the Linux kernel.

Seen under this light, one could consider the Linux kernel as a type of virtualisation itself: for each process running under the kernel, a well defined interface is provided. The processes cannot (except by well-defined means) interact with each other, and they do not need to care about the hardware specifics.

Full Virtualisation

See also

The Wikipedia article on Full Virtualisation. Some of the content in this section is based on that article.

A naive (but already very complex) type of virtualisation is Full Virtualisation. Here, the interface exposed to the Guest is that of actual hardware. CPUs, along with memory and peripherals like hard drives, are emulated in its entirety by the Host system. Naturally, this type of virtualisation is very slow: the Host system needs to spend several cycles for each operation of the Guest system which would normally be handled by hardware. Thus, this type of virtualisation incurs a significant slowdown.

The costs of Full Virtualisation can be mitigated with hardware support on the Host system. On modern x86_64 architectures for example, there are processor extensions which allow Full Virtualisation of other x86-based CPUs with relatively low costs. This means that the Host system only needs to emulate peripherals. This is also called Hardware-assisted virtualisation.

The costs for emulating peripherals can be reduced further by not emulating standard hardware but hardware specialised for virtualisation. For example, there is no need to emulate a PCI bus for a network interface card if the Guest system knows it runs in a virtualised environment. A much simpler interface would be “Write each Ethernet frame you want to send to a specific page and then run interrupt number X”. While I’m sure that’s not exactly how for example virtio devices are implemented, this is the general idea: remove hardware specifics and reduce the interface to the absolute minimum (overhead-wise). virtio is very common in the QEMU/KVM-Host-Linux-Guest world (which we will deal with).

This is already crossing the separation to Paravirtualisation, so let’s go on there.

Paravirtualisation

See also

The Wikipedia article on Paravirtualisation. Some of the content in this section is based on that article.

As hinted on in the previous article, Paravirtualisation is the type of virtualisation one gets when one can make some agreements with the Guest. The Guest cannot be an unmodified thing (e.g. Operating System) which would run on the interface the virtualisation aims to provide (e.g. PC hardware): it needs to know about the virtualisation for this to work.

Specialised virtual hardware as mentioned in the previous section is a common way to do that. It also means that a Guest OS may still work partially, but not be able to interface with some hardware.

Containers

Containers are really a special form of Paravirtualisation where the interface which is virtualised is the kernel of the Host operating system. In case of Linux containers, this is obviously Linux.

I will go a bit more into detail on Linux here. On a Linux system, we already have well-separated processes, so some argue that’s all the isolation you need. However, there are shared resources which do well from being isolated too, which is why Linux Namespaces were invented. A Namespace is basically an isolated version of a resource space, for use only by processes which live in that Namespace.

There are namespaces available for Cgroups (which are by themselves a powerful tool to control resource usage of processes), Inter-Process Communication, Networking, the Filesystem tree (the Mount namespace), Process IDs, User- and Group IDs, and finally basic system information like the Hostname (UTS namespace).

By starting a process in a new namespace for each of these, we essentially get an isolated Guest Linux system which cannot interact with the Host system much. It is not that simple though, which is why there are tools which manage this type of container.

See also

The namespaces manpage (chapter 7)
It documents the behaviour of Linux Namespaces in more detail, but also more technically. An online version is available on man7.org. It also has links to documentation of the individual namespaces.

Trade-offs, or Deciding for a Type of Virtualisation

We have now discussed multiple forms of Virtualisation. Each has its merits and purpose in existence, but for a specific use-case, we have to decide which type of virtualisation to use. So here is a quick summary of what you can do with each type of virtualisation.

Feature Full Virtualisation Paravirtualisation Linux Containers
Can support “any” Guest OS yes no (1) no
Free choice of supported Guest OS by user yes yes no
Overhead very high medium low
Isolation level highest (3) high it depends (2)

Notes:

  1. Paravirtualisation can support many OSes, but it requires additional support. So it is unlikely that you will be able to run your 90ies copy of MS-DOS on virtio.
  2. The isolation level highly depends on the configuration of the individual container.
  3. Even with Full virtualisation, the isolation between Guest and Host, as well as between different Guests may be broken by (often performance-enhancing) technologies employed on the Host, for example Kernel Samepage Merging.

So as a rule of thumb: If you have the resources and you need strong isolation (for example to host virtual machines for arbitrary third-party users), use Full or Paravirtualisation. If you need to isolate individual services or simply need to provide a specific environment for a service to run in, Containers are the tool of choice. For anything in-between, for example if you need strong isolation but don’t have the resources for Paravirtualisation, you will have to consider the trade-off of having less isolation or having to save some resources elsewhere.

In general, the trade-off is between resources saved and isolation achieved. More isolation means more resources need to be expended, simply because more things need to be emulated instead of re-used.

The remainder of this section will deal with setting up virtual machines and containers on a modern Linux system.

Introduction in libvirt

libvirt is, in my opinion, the tool to manage both Containers and QEMU/KVM Virtual Machines on Linux. Before we go into details, let us first describe the low-level tools which are available to even create any type of virtualisation on Linux.

Tools for Virtualisaton on Linux

For Full and Paravirtualisation, there is QEMU/KVM. KVM stands for Kernel Virtual Machines and, as you might guess from the title, uses Kernel and hardware support to achieve fast virtualisation where supported for the specific Host and Guest platform combination. QEMU is a frontend and also a set of implementations for Full and Paravirtualisation of different platforms. There are full emulators for mips, Power PC, Sparc, ARM and of course x86, but in general you’re better off using the Paravirtualisation support based on KVM. KVM makes use of the platforms hardware-assisted virtualisation extensions, if available. As discussed, this greatly reduces the overhead induced by virtualisation.

For Containers, there is LXC. It manages the creation of namespaces and Cgroups to isolate the Linux-based operating systems running in your containers. There also is tooling to bootstrap different Distributions.

While the LXC user interface is slightly more convenient than QEMU/KVM, I prefer to have all my virtualisation managed by a single entity. This helps with setting up coherent and reusable networking and firewalling, which is why I am advocating the use of libvirt.

By the way, tools like LXC and QEMU which effectively run the virtualised Guest operating system are called Hypervisors. There are other Hypervisors for Linux, but they are at least partially proprietary or simply based on LXC or QEMU, which is why I will not go into detail on those.

What is libvirt?

libvirt is a set of tools and libraries which manage different Hypervisors with a single, XML-based interface. Virtual Machines (or Domains in the libvirt language), Networks between those and the Host as well as other management related to virtualisation is managed through XML definitions. Those are well-documented on the libvirt website.

Networking with libvirt

For the basic use-cases like simple NAT or routing, the network capabilities of libvirt will be sufficient to you. However, port forwarding in a NAT scenario is not available with libvirt—it needs to be implemented with iptables rules manually, and will is discussed in the section on Firewalling.

Example Use Cases

Note

On Debian 8 (Jessie), the packages required for libvirt with QEMU/KVM can be installed via:

apt install libvirt0 libvirt-bin

So now that we have discussed the available tooling, let us go through a few “simple” use cases.

Setting Up a Network with libvirt

Requirements

  • You have the 203.0.113.177/29 IPv4 network (8 addresses) and the 2001:db8:e2f3:e12d::/64 IPv6 network (see The Internet Protocol) routed to your Host.
  • You want to add virtual machines and/or containers to that network and assign them addresses.

Required steps

First, you would create a libvirt network in which the machines can live. We don’t need NAT here, so forwarding is trivial. A network could look like this (we assume that eth0 is the network interface to the internet):

<network>
  <name>for-all-vms</name>
  <!-- use forwarding mode here, no NAT required -->
  <forward dev='eth0' mode='route'>
    <interface dev='eth0'/>
  </forward>
  <!-- this defines the name for the bridge interface
       used for the guests -->
  <bridge name='guests1' stp='on' delay='0'/>
  <ip address='203.0.113.177' prefix='29'>
    <dhcp>
      <range start='203.0.113.178' end='203.0.113.182'/>
    </dhcp>
  </ip>
  <ip family='ipv6' address='2001:db8:e2f3:e12d::1' prefix='64'>
  </ip>
  <ip family='ipv6' address='fe80::1' prefix='64'>
  </ip>
</network>

We can define, start and autostart this network with libvirt by running (assuming the network is in a file called for-all-vms.xml):

# virsh net-define 'for-all-vms.xml'
## note: `for-all-vms` is the name defined in the <name/> element,
## it has nothing to do with the filename!
# virsh net-start 'for-all-vms'
# virsh net-autostart 'for-all-vms'

You should now see such a network:

# brctl show guests1
bridge name  bridge id               STP enabled     interfaces
guests1              8000.52540025b71a       yes             guests1-nic

# ip address show dev guests1
3: guests1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 52:54:00:25:b7:1a brd ff:ff:ff:ff:ff:ff
    inet 203.0.113.177/29 brd 203.0.113.183 scope global guests1
       valid_lft forever preferred_lft forever
    inet6 2001:db8:e2f3:e12d::1/64 scope global tentative
       valid_lft forever preferred_lft forever
    inet6 fe80::1/64 scope link tentative
       valid_lft forever preferred_lft forever

# ip route show dev guests1
203.0.113.176/29  proto kernel  scope link  src 203.0.113.177

As you can see, libvirt took care of setting up the bridge interface for us, setting up addresses and routes. It even has defined IPtables rules for forwarding:

# iptables-save
*mangle
:PREROUTING ACCEPT [389:27040]
:INPUT ACCEPT [389:27040]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [239:26500]
:POSTROUTING ACCEPT [239:26500]
-A POSTROUTING -o guests1 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill
COMMIT
*filter
:INPUT ACCEPT [389:27040]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [248:27444]
-A INPUT -i guests1 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -i guests1 -p tcp -m tcp --dport 53 -j ACCEPT
-A INPUT -i guests1 -p udp -m udp --dport 67 -j ACCEPT
-A INPUT -i guests1 -p tcp -m tcp --dport 67 -j ACCEPT
-A FORWARD -d 203.0.113.176/29 -i eth0 -o guests1 -j ACCEPT
-A FORWARD -s 203.0.113.176/29 -i guests1 -o eth0 -j ACCEPT
-A FORWARD -i guests1 -o guests1 -j ACCEPT
-A FORWARD -o guests1 -j REJECT --reject-with icmp-port-unreachable
-A FORWARD -i guests1 -j REJECT --reject-with icmp-port-unreachable
-A OUTPUT -o guests1 -p udp -m udp --dport 68 -j ACCEPT
COMMIT

# ip6tables-save
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [6:636]
-A INPUT -i guests1 -p udp -m udp --dport 547 -j ACCEPT
-A INPUT -i guests1 -p udp -m udp --dport 53 -j ACCEPT
-A INPUT -i guests1 -p tcp -m tcp --dport 53 -j ACCEPT
-A FORWARD -d fe80::/64 -i eth0 -o guests1 -j ACCEPT
-A FORWARD -s fe80::/64 -i guests1 -o eth0 -j ACCEPT
-A FORWARD -d 2001:db8:e2f3:e12d::/64 -i eth0 -o guests1 -j ACCEPT
-A FORWARD -s 2001:db8:e2f3:e12d::/64 -i guests1 -o eth0 -j ACCEPT
-A FORWARD -i guests1 -o guests1 -j ACCEPT
-A FORWARD -o guests1 -j REJECT --reject-with icmp6-port-unreachable
-A FORWARD -i guests1 -j REJECT --reject-with icmp6-port-unreachable
COMMIT

This is quite handy, because now we have a network we can use for any virtual machine or container.

Host Virtual Machines for Third-Party Users

Requirements

  • You have the network from the previous use-case set up.
  • Each individual machine should be addressable by a single IPv4 and a single IPv6 address.
  • Users should be able to pick their own operating system, as long as it runs on x86_64.
  • Users should be able to start/stop/reset their virtual machine at will.
  • Users should be able to get a virtual console to debug when the network setup on their machine is broken or during install.

Required steps

First, you should talk to your users what they need. Important questions to ask your users and then yourself (do you want to allow that specific behaviour on your infrastructure?):

  • Which Operating System do they want to run?
  • How many virtual machines? And for each virtual machine:
    • How many CPU cores?
    • How much Memory?
    • How much Disk Space?
    • How many network interfaces?
      • How shall these network interfaces be connected?
  • Do they need Nested Virtualisation (i.e. do they want to run Paravirtualised Guests inside their Guests)?

Once you have answers to these questions, you can start writing a template for the virtual machines. It could look like this:

<domain type='kvm'>
  <name>MACHINENAME</name>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <vcpu placement='static'>8</vcpu>
  <resource>
    <partition>/machine/USERNAME</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.1'>hvm</type>
    <bios useserial='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-model'>
    <model fallback='allow'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled='yes'/>
    <suspend-to-disk enabled='yes'/>
  </pm>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/vg_main/LOGICAL_VOLUME'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/CDIMAGE'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <boot order='2'/>
    </disk>
    <interface type='bridge'>
      <mac address='52:54:00:0e:7c:MACSUFFIX'/>
      <source bridge='guests1'/>
      <target dev='INTERFACENAME'/>
      <model type='virtio'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <graphics type='spice' port='PORT' autoport='no' passwd='PASSWORD'>
      <listen type='address' address='127.0.0.1'/>
    </graphics>
  </devices>
</domain>

A few notes:

  • The amount of memory in the template is fixed, you can choose any value you like.
  • The ALL CAPS things need to be replaced for each VM.
  • The template uses LVM volumes for each guest as backing store; you may use any type of device supported by libvirt you like.
  • The CDIMAGE is supposed to be something bootable which boots the installer.
  • For installation, you will need to make it possible for users to connect to 127.0.0.1:PORT on the VM host. You can achieve that by granting them restricted SSH access for port forwarding. I would not trust SPICE or VNC to run unencrypted over the public internet, so this seems like a viable option. SSH access will come in handy for users to manage their VMs anyways.
  • The network definition as-is does not prevent any kind of ARP or DHCP spoofing. Do not use with malicious guests.
  • This is a Work-in-Progress.