Lately, I have been digging into the OCI Runtime specification and
runc, its reference implementation written in Go. Although I have been
working with containers for as long as Kubernetes has existed, I must now admit
that the runtime aspect of the standardization effort, which Linux containers
underwent throughout the existence of the Open Container Initiative (OCI), went
largely unnoticed to me.
Although I see it as a testament that the standardization was handled skillfully, without causing massive disruptions to platforms that rely on the container technology, it was about time for me to fill that knowledge gap.
In case you aren’t familiar with the separation of concern between a Container manager and a Container runtime, I highly recommend reading Journey From Containerization To Orchestration And Beyond, by Ivan Velichko. It is a clear and unambiguous entry point into this rabbit hole, with a bunch of external links to additional high quality resources.
Operations of an OCI Runtime
A container runtime must implement the following five self-explanatory operations to be considered compliant with the OCI Runtime specification:
These operations are abstract. The specification does not mandate any
particular command-line API to implement a CLI runtime like
some efforts exist to specify what compliance means in terms of command-line
interface. This felt confusing to me at first.
Here is a sample visualization of a container lifecycle that leverages all five of these operations:
In the rest of this article, I will focus solely on what exactly happens in the Create and Start phases.
Step by step
As seen in the previous illustration, running a containerized process through
an OCI runtime happens in two steps: create, start. For someone who typically
interacts with high level container runtimes such as Docker, it is tempting to
conflate these OCI operations with commands such as
[create|start]. This can be deceptive because those aren’t equivalent, despite
having similar semantics. To understand why, we have to inspect what happens to
OS processes when both of these operations are executed on the Linux host.
All experiments below are performed against an OCI bundle I called
mybundle. It was generated from the contents of the Docker Hub image
docker.io/library/nginx, which rootfs can be exported using an OCI image
manipulation tool like
crane. The directory structure of the bundle is
like below, where
config.json is the OCI runtime configuration file for
the container, as generated by
runc spec or a container runner/engine.
runc requires a root directory, and a subdirectory matching the id of the
mycontainer) to store its state between each operation. These are
usually managed by a container manager like
containerd, but we are operating
outside of the supervision of such manager here.
runc create on our bundle:
Now let’s query its state:
One intriguing detail is that the container state already includes a process
19374), although its status is
created and not
running. This is the
first major difference with
docker container create or its clones, which
create a container from a specified image without starting it.
Now let’s check the running processes. Inside a WSL instance based on Ubuntu, the process tree looks like below:
This output includes a second intriguing but significant detail: the pid
corresponds to a
runc init process, which has nothing to do with our
Let’s see what happens to this process when we invoke
For a very brief instant, we can witness the container’s entrypoint command being executed with the same process id, then immediately disappear:
Let’s query its state one more time:
The container now has the
stopped status and its process id is no longer
Before we dive a little further, let’s try to explain the behaviour we just
observed. For this, we will be using one of
runc’s higher level commands:
runc run. This command doesn’t directly map to any of the OCI runtime
operations previously enumerated, but can be roughly described as a
start, with subtle differences.
First, we must delete the terminated container:
runc run the same way we invoked
runc create earlier:
This time around, the command doesn’t return. Instead, the standard output of
the container’s init process (
nginx) is being printed to the standard output
of our terminal:
/docker-entrypoint.sh: Configuration complete; ready for start up 2023/08/17 19:12:34 [notice] 1#1: using the "epoll" event method 2023/08/17 19:12:34 [notice] 1#1: nginx/1.25.1 2023/08/17 19:12:34 [notice] 1#1: built by gcc 12.2.1 20220924 (Alpine 12.2.1) 2023/08/17 19:12:34 [notice] 1#1: OS: Linux 220.127.116.11-microsoft-standard-WSL2 2023/08/17 19:12:34 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1024:1024 2023/08/17 19:12:34 [notice] 1#1: start worker processes
In a separate terminal window, let’s inspect the state of the container, as well as the process tree:
The process id reported by
runc state is now shown with the
The corresponding process is visible in the process tree, and its command is
nginx master process, per the container’s configuration.
As you might have already guessed, the differences we observed during this experiment are related to standard I/O streams (stdin, stdout, stderr):
runc run, the container’s init process (
nginx) remains parented to
runc, which was forked from the shell, and therefore inherited the I/O streams of that shell (foreground mode).
runc start, the container’s init process had already been orphaned and re-parented to the most immediate
inithost process, then reaped by it immediately after exiting due to a failed write to its closed stdout.
With a container manager such as
containerd in the picture, the latter is
circumvented by interpolating a “container runtime shim” process between the
manager and the container process. Again, the role of the runtime shim is very
well described by Ivan in Journey From Containerization To Orchestration And
Beyond so I am not going to expand further on that topic.
But explaining the reason why the process exited in detached mode isn’t what I’m interested in here. Let’s move on and see what actually happens in these two disjointed phases of the creation of an OCI container (Create, Start).
Multiple init phases
One might wonder what this container startup flow in two phases is good for, if
the container can technically be started in one; the OCI Runtime specification
doesn’t expand on the intentions behind this design. I believe that this
question can be answered by drawing a parallel with some of the UNIX APIs for
Interlude: UNIX processes
In UNIX systems, running a program that is different from the calling program requires two system calls:
fork()to create an (almost) identical copy of the calling (parent) process
exec()to transform the currently running program into a different running program, without creating a new process
This distinction allows the parent to run code between
This is essential for a program like a UNIX shell, as it enables features such
as redirection of I/O streams and other process manipulations.
For instance, when shell commands such as
echo 'hi' >out.txt or
echo 'hi' |
wc are executed, the shell performs the following actions under the hood:
- Creates a child process with
- Closes the process’ stdout, and obtain a file descriptor for whichever destination stdout should be redirected to (in the example above: a file or pipe)
- Assigns this file descriptor to the process’ stdout
- Runs the
echocommand by calling
This concept is very powerful, and concisely explained inside the fifth chapter “Process API” of the (free) book Operating Systems: Three Easy Pieces, by Remzi and Andrea Arpaci-Dusseau.
OCI runtime init
With these essential concepts clarified, we now have enough context to understand what may happen between the Create and Start phases of an OCI container’s lifecycle.
In the first section of this article, we saw that
runc create had eventually
given birth to a new process which wasn’t the expected container’s application,
runc itself. A careful study of libcontainer reveals that
this phase, called bootstrap, starts a parent process with the command
/proc/self/exe init (the
runc executable itself). This process
inherits the Linux cgroups and namespaces of the future container init process
nginx), receives the OCI runtime configuration of the container
process to be executed from its parent, and remains in that state until the
Start phase is initiated1.
This is conceptually very similar to the
fork() UNIX syscall, in the context
fork() + exec().
At the end of the Create phase,
runc writes the state file
inside the root directory referenced by the
--root CLI flag. This file
contains information such as the process id of the bootstrap process. It is read
by all subsequent
runc commands, as a source of truth to be able to determine
the current status of the container based on the state of its init process.
/run/runc/mycontainer └── state.json
Additionally—and I deliberately omitted to mention it until now—a named pipe
exec.fifo was created inside that same root directory.
/run/runc/mycontainer ├── exec.fifo └── state.json
This named pipe is what enables the host to communicate its intention to start
the container to the bootstrap process, and initiate the next phase: Start.
The write end of this named pipe is attached to the bootstrap process, while
its read end remains closed for now. The bootstrap process remains stuck on a
blocking write to
runc start gives its “go” by
reading from it.
At this stage, the caller—for instance a container manager like
free to perform whatever additional step(s) it sees fit before starting the
container. In the wild, this often translates into setting up the container’s
network interface(s) by invoking a chain of CNI plugins2. After such
operations are complete, the caller may initiate the Start phase.
What happens in this second and final phase of the container’s startup is
exec() syscall from the bootstrap process into the
container’s init process, triggered by a read from the aforementioned
exec.fifo named pipe. The bootstrap process already received the container’s
runtime configuration from the parent
runc process during the Create phase,
so there are no additional steps to be performed here.
exec.fifo named pipe is deleted.
Appendix: bootstrap process uncut
The entirety of the
initProcess struct—with a few irrelevant attributes
omitted for clarity—is exposed below for reference:
Without going into the detail of each nested attribute, it is worth highlighting a few of its properties:
cmddescribes the bootstrap command:
- The named pipe
exec.fifo, which is used for communicating the beginning of the Start phase to the bootstrap process, has its file descriptor referenced in the environment variable
_LIBCONTAINER_FIFOFD. It references the
cmd.ExtraFilesitem with the address
(*os.File)(0xc000014be8), which incidentally is also referenced by
- The named pipe
- A de-serialized version of the OCI runtime configuration (
config.json) is visible in the
configfield: container cmd, environment variables, working directory, etc.
- A large part of this configuration is in fact hidden behind the
config.Capabilitiesfields, but expanding them here would add a lot of noise to the sample, and both are well described inside the OCI runtime specification.
- This entire container configuration is communicated in JSON format to the
bootstrap process over the UNIX socket pair
messageSockPair, as soon as the bootstrap process is started. This data is critical as it allows the bootstrap process to
exec()the container’s command with the expected environment during the Start phase.
- A large part of this configuration is in fact hidden behind the
processfields have some overlap with the
configfield. These are respectively the internal representation of the container (as seen by
libcontainer), and a representation of the container’s init process specifically (
- The cgroups assigned to the bootstrap process, and therefore to the
container’s init process, are visible in the
runc createactually forks twice during the bootstrap sequence to allow the second child to be started inside the final Linux namespaces, while the first child eventually exits. The detail of this flow would require a dedicated article. ↩
The CNI specification is not part of the Open Container Initiative (OCI). It is however a widely adopted standard within the Cloud Native Computing Foundation (CNCF) ecosystem, predominantly through the Kubernetes project. ↩