DOKK Library

Containers and Pods 101

Authors Chris Collins Daniel J Walsh Nived Velayudhan Seth Kenlon

License CC-BY-SA-4.0

Plaintext
Containers and Pods 101
We are Opensource.com

Opensource.com is a community website publishing stories about creating, adopting, and
sharing open source solutions. Visit Opensource.com to learn more about how the open
source way is improving technologies, education, business, government, health, law,
entertainment, humanitarian efforts, and more.

Do you have an open source story to tell? Submit a story idea at opensource.com/story

Email us at open@opensource.com
Table of Contents
A sysadmin's guide to containers.........................................................................................................3
3 steps to start running containers today............................................................................................7
How I build my personal website using containers with a Makefile................................................13
Podman: A more secure way to run containers.................................................................................19
How to SSH into a running container.................................................................................................23
Run containers on Linux without sudo in Podman...........................................................................28
What is a container image?...................................................................................................................31
4 Linux technologies fundamental to containers............................................................................35
What are container runtimes?.............................................................................................................40
A sysadmin's guide to containers

By Daniel J Walsh

The term "containers" is heavily overused. Also, depending on the context, it can
mean different things to different people.

Traditional Linux containers are really just ordinary processes on a Linux system. These groups
of processes are isolated from other groups of processes using resource constraints (control
groups [cgroups]), Linux security constraints (Unix permissions, capabilities, SELinux,
AppArmor, seccomp, etc.), and namespaces (PID, network, mount, etc.).

If you boot a modern Linux system and took a look at any process with cat
/proc/PID/cgroup, you see that the process is in a cgroup.

If you look at /proc/PID/status, you see capabilities. If you look at
/proc/self/attr/current, you see SELinux labels. If you look at /proc/PID/ns, you
see the list of namespaces the process is in. So, if you define a container as a process with
resource constraints, Linux security constraints, and namespaces, by definition every process
on a Linux system is in a container. This is why we often say Linux is containers, containers are
Linux. Container runtimes are tools that modify these resource constraints, security, and
namespaces and launch the container.

Docker introduced the concept of a container image, which is a standard TAR file that
combines:

    • Rootfs (container root filesystem): A directory on the system that looks like the
      standard root (/) of the operating system. For example, a directory with /usr, /var,
      /home, etc.
    • JSON file (container configuration): Specifies how to run the rootfs; for example,
      what command or entrypoint to run in the rootfs when the container starts;
      environment variables to set for the container; the container's working directory;
      and a few other settings.

Creative Commons Attribution Share-alike 4.0                                                    3
Docker "tar's up" the rootfs and the JSON file to create the base image. This enables you to
install additional content on the rootfs, create a new JSON file, and tar the difference
between the original image and the new image with the updated JSON file. This creates a
layered image.

The definition of a container image was eventually standardized by the Open Container
Initiative (OCI) standards body as the OCI Image Specification.

Tools used to create container images are called container image builders. Sometimes
container engines perform this task, but several standalone tools are available that can build
container images.

Docker took these container images (tarballs) and moved them to a web service from which
they could be pulled, developed a protocol to pull them, and called the web service a
container registry.

Container engines are programs that can pull container images from container registries
and reassemble them onto container storage. Container engines also launch container
runtimes (see below).




Creative Commons Attribution Share-alike 4.0                                                     4
Linux container internals. Illustration by Scott McCarty. CC BY-SA 4.0

Container storage is usually a copy-on-write (COW) layered filesystem. When you pull down
a container image from a container registry, you first need to untar the rootfs and place it on
disk. If you have multiple layers that make up your image, each layer is downloaded and stored
on a different layer on the COW filesystem. The COW filesystem allows each layer to be
stored separately, which maximizes sharing for layered images. Container engines often
support multiple types of container storage, including overlay, devicemapper, btrfs,
aufs, and zfs.

After the container engine downloads the container image to container storage, it needs to
create a container runtime configuration. The runtime configuration combines input from
the caller/user along with the content of the container image specification. For example, the
caller might want to specify modifications to a running container's security, add additional
environment variables, or mount volumes to the container.

Creative Commons Attribution Share-alike 4.0                                                      5
The layout of the container runtime configuration and the exploded rootfs have also been
standardized by the OCI standards body as the OCI Runtime Specification.

Finally, the container engine launches a container runtime that reads the container runtime
specification; modifies the Linux cgroups, Linux security constraints, and namespaces; and
launches the container command to create the container's PID 1. At this point, the container
engine can relay stdin/stdout back to the caller and control the container (e.g., stop, start,
attach).

Note that many new container runtimes are being introduced to use different parts of Linux to
isolate containers. People can now run containers using KVM separation (think mini virtual
machines) or they can use other hypervisor strategies (like intercepting all system calls from
processes in containers). Since we have a standard runtime specification, these tools can all
be launched by the same container engines. Even Windows can use the OCI Runtime
Specification for launching Windows containers.

At a much higher level are container orchestrators. Container orchestrators are tools used
to coordinate the execution of containers on multiple different nodes. Container
orchestrators talk to container engines to manage containers. Orchestrators tell the container
engines to start containers and wire their networks together. Orchestrators can monitor the
containers and launch additional containers as the load increases.




Creative Commons Attribution Share-alike 4.0                                                     6
3 steps to start running containers
today

By Seth Kenlon

Whether you're interested in them as part of your job, for future job opportunities, or just out
of interest in new technology, containers can seem pretty overwhelming to even an
experienced systems administrator. So how do you actually get started with containers? And
what's the path from containers to Kubernetes? Also, why is there a path from one to the
other at all? As you might expect, the best place to start is the beginning.


1. Understanding containers
On second thought, starting at the beginning arguably dates back to early BSD and their
special chroot jails, so skip ahead to the middle instead.

Not so very long ago, the Linux kernel introduced cgroups, which enables you to "tag"
processes with something called a namespace. When you group processes together into a
namespace, those processes act as if nothing outside that namespace exists. It's as if you've
put those processes into a sort of container. Of course, the container is virtual, and it exists
inside your computer. It runs on the same kernel, RAM, and CPU that the rest of your
operating system is running on, but you've contained the processes.

Pre-made containers get distributed with just what's necessary to run the application it
contains. With a container engine, like Podman, Docker, or CRI-O, you can run a containerized
application without installing it in any traditional sense. Container engines are often cross-
platform, so even though containers run Linux, you can launch containers on Linux, macOS, or
Windows.

More importantly, you can run more than one container of the same application when there's
high demand for it.


Creative Commons Attribution Share-alike 4.0                                                       7
Now that you know what a container is. The next step is to run one.


2. Run a container
Before running a container, you should have a reason for running a container. You can make up
a reason, but it's helpful for that reason to interest you, so you're inspired actually to use the
container you run. After all, running a container but never using the application it provides only
proves that you're not noticing any failures, but using the container demonstrates that it
works.

I recommend WordPress as a start. It's a popular web application that's easy to use, so you
can test drive the app once you've got the container running. While you can easily set up a
WordPress container, there are many configuration options, which can lead you to discover
more container options (like running a database container) and how containers communicate.

I use Podman, which is a friendly, convenient, and daemonless container engine. If you don't
have Podman available, you can use the Docker command instead. Both are great open
source container engines, and their syntax is identical (just type docker instead of podman).
Because Podman doesn't run a daemon, it requires more setup than Docker, but the ability to
run rootless daemonless containers is worth it.

If you're going with Docker, you can skip down to the WordPress subheading. Otherwise, open
a terminal to install and configure Podman:

$ sudo dnf install podman


Containers spawn many processes, and normally only the root user has permission to create
thousands of process IDs. Add some extra process IDs to your user by creating a file called
/etc/subuid and defining a suitably high start UID with a suitable large number of
permitted PIDs:

seth:200000:165536


Do the same for your group in a file called /etc/subgid. In this example, my primary group is
staff (it may be users for you, or the same as your username, depending on how you've
configured your system.)

staff:200000:165536




Creative Commons Attribution Share-alike 4.0                                                         8
Finally, confirm that your user is also permitted to manage thousands of namespaces:

$ sysctl --all --pattern user_namespaces
user.max_user_namespaces = 28633


If your user doesn't have permission to manage at least 28,000 namespaces, increase the
number by creating the file /etc/sysctl.d/userns.conf and enter:

user.max_user_namespaces=28633



Running WordPress as a container
Now, whether you're using Podman or Docker, you can pull a WordPress container from a
container registry online and run it. You can do all this with a single Podman command:

$ podman run --name mypress -p 8080:80 -d wordpress

Give Podman a few moments to find the container, copy it from the internet, and start it up.

Start a web browser once you get a terminal prompt back and navigate to localhost:8080.
WordPress is running, waiting for you to set it up.




                                    (Seth Kenlon, CC BY-SA 4.0)




Creative Commons Attribution Share-alike 4.0                                                   9
It doesn't take long to reach your next hurdle, though. WordPress uses a database to keep
track of data, so you need to provide it with a database where it can store its information.

Before continuing, stop and remove the WordPress container:

$ podman stop mypress
$ podman rm mypress




3. Run containers in a pod
Containers are, by design and, as their name suggests, self-contained. An application running
in a container isn't supposed to interact with applications or infrastructure outside of its
container. So when one container requires another container to function, one solution is to
put those two containers inside a bigger container called a pod. A pod ensures that its
containers can share important namespaces to communicate with one another.

Create a new pod, providing a name for the pod and which ports you want to be able to
access:

$ podman pod create --name wp_pod --publish 8080:80

Confirm that the pod exists:

$ podman pod list
POD ID        NAME           STATUS     INFRA ID         # OF CONTAINERS
100e138a29bd wp_pod          Created    22ace92df3ef      1



Add a container to a pod
Now that you have a pod for your interdependent containers, you launch each container by
specifying a pod for it to run in.

First, launch a database. You can make up your own credentials as long as you use those same
credentials when connecting to the database from WordPress.

$ podman run --detach \
--pod wp_pod \
-restart=always \
-e MYSQL_ROOT_PASSWORD="badpassword0" \
-e MYSQL_DATABASE="wp_db" \
-e MYSQL_USER="tux" \
-e MYSQL_PASSWORD="badpassword1" \
--name=wp_db mariadb



Creative Commons Attribution Share-alike 4.0                                                   10
Next, launch the WordPress container into the same pod:

$ podman run --detach \
--restart=always --pod=wp_pod \
-e WORDPRESS_DB_NAME="wp_db" \
-e WORDPRESS_DB_USER="tux" \
-e WORDPRESS_DB_PASSWORD="badpassword1" \
-e WORDPRESS_DB_HOST="127.0.0.1" \

--name mypress wordpress


Now launch your favorite web browser and navigate to localhost:8080.

This time, the setup goes as expected. WordPress connects to the database because you've
passed those environment variables while launching the container.




                                   (Seth Kenlon, CC BY-SA 4.0)


After you've created a user account, you can log in to see the WordPress dashboard.




Creative Commons Attribution Share-alike 4.0                                               11
                                    (Seth Kenlon, CC BY-SA 4.0)



Next steps
You've created two containers, and you've run them in a pod. You know enough now to run
services in containers on your own server. If you want to move to the cloud, containers are, of
course, well-suited for that. With tools like Kubernetes and OpenShift, you can automate the
process of launching containers and pods on a cluster.




Creative Commons Attribution Share-alike 4.0                                                  12
How I build my personal website
using containers with a Makefile

By Chris Collins

The make utility and its related Makefile have been used to build software for a long time. The
Makefile defines a set of commands to run, and the make utility runs them. It is similar to a
Dockerfile or Containerfile—a set of commands used to build container images.

Together, a Makefile and Containerfile are an excellent way to manage a container-based
project. The Containerfile describes the contents of the container image, and the Makefile
describes how to manage the project itself: kicking the image build, testing, and deployment,
among other helpful commands.


Make targets
The Makefile consists of "targets": one or more commands grouped under a single command.
You can run each target by running the make command followed by the target you want to
run. This command runs a target called image_build, defined in a Makefile:

$ make image_build


This is the beauty of the Makefile. You can build a collection of targets for each task that
needs to be performed manually. In the context of a container-based project, this includes
building the image, pushing it to a registry, testing the image, and even deploying the image
and updating the service running it. I use a Makefile for my personal website to do all these
tasks in an easy, automated way.




Creative Commons Attribution Share-alike 4.0                                                    13
Build, test, deploy
I build my website using Hugo, a static website generator that builds static HTML from YAML
files. I use Hugo to build the HTML files for me, then build a container image with those files
and Caddy, a fast and simple web server, and run that image as a container. (Both Hugo and
Caddy are open source, Apache-licensed projects.) I use a Makefile to make building and
deploying that image to production much easier.

The first target in the Makefile is appropriately the image_build command:

image_build:
     podman build --format docker -f Containerfile -t $(IMAGE_REF):$(HASH) .


This target invokes Podman to build an image from the Containerfile included in the project.
There are some variables in the command above—what are they? Variables can be specified in
the Makefile, similarly to Bash or a programming language. I use them for a variety of things
within the Makefile, but the most useful is building the image reference to be pushed to
remote container image registries:

# Image values
REGISTRY := "us.gcr.io"
PROJECT := "my-project-name"
IMAGE := "some-image-name"
IMAGE_REF := $(REGISTRY)/$(PROJECT)/$(IMAGE)
# Git commit hash
HASH := $(shell git rev-parse --short HEAD)


Using these variables, the image_build target builds an image reference like
us.gcr.io/my-project-name/my-image-name:abc1234 using the short Git revision
hash as the image tag so that it can be tied to the code that built it easily.

The Makefile then tags that image as :latest. I don't generally use :latest for anything in
production, but further down in this Makefile, it will come in useful for cleanup:

image_tag:
     podman tag $(IMAGE_REF):$(HASH) $(IMAGE_REF):latest


So, now the image has been built and needs to be validated to make sure it meets some
minimum requirements. For my personal website, this is honestly just, "does the webserver
start and return something?" This could be accomplished with shell commands in the Makefile,
but it was easier for me to write a Python script that starts a container with Podman, issues an


Creative Commons Attribution Share-alike 4.0                                                      14
HTTP request to the container, verifies it receives a reply, and then cleans up the container.
Python's "try, except, finally" exception handling is perfect for this and considerably easier
than replicating the same logic from shell commands in a Makefile:

#!/usr/bin/env python3
import time
import argparse
from subprocess import check_call, CalledProcessError
from urllib.request import urlopen, Request
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--image', action='store', required=True, help='image
name')
args = parser.parse_args()
print(args.image)
try:
     check_call("podman rm smk".split())
except CalledProcessError as err:
     pass
check_call(
     "podman run --rm --name=smk -p 8080:8080 -d {}".format(args.image).split()
)
time.sleep(5)
r = Request("http://localhost:8080", headers={'Host': 'chris.collins.is'})
try:
     print(str(urlopen(r).read()))
finally:
     check_call("podman kill smk".split())


This could be a more thorough test. For example, during the build process, the Git revision
hash could be built into the response, and the test could check that the response included the
expected hash. This would have the benefit of verifying that at least some of the expected
content is there.

If all goes well with the tests, then the image is ready to be deployed. I use Google's Cloud
Run service to host my website, and like any of the major cloud services, there is an excellent
command-line interface (CLI) tool that I can use to interact with the service. Since Cloud Run
is a container service, deployment consists of pushing the images built locally to a remote
container registry, and then kicking off a rollout of the service using the gcloud CLI tool.

You can do the push using Podman or Skopeo (or Docker, if you're using it).




Creative Commons Attribution Share-alike 4.0                                                      15
My push target pushes the $(IMAGE_REF):$(HASH) image and also the :latest tag:

push:
        podman push --remove-signatures $(IMAGE_REF):$(HASH)
        podman push --remove-signatures $(IMAGE_REF):latest


After the image has been pushed, use the gcloud run deploy command to roll out the
newest image to the project and make the new image live. Once again, the Makefile comes in
handy here. I can specify the --platform and --region arguments as variables in the
Makefile so that I don't have to remember them each time. Let's be honest: I write so
infrequently for my personal blog, there is zero chance I would remember these variables if I
had to type them from memory each time I deployed a new image:

rollout:
     gcloud run deploy $(PROJECT) --image $(IMAGE_REF):$(HASH) --platform $
(PLATFORM) --region $(REGION)




More targets
There are additional helpful make targets. When writing new stuff or testing CSS or code
changes, I like to see what I'm working on locally without deploying it to a remote server. For
this, my Makefile has a run_local command, which spins up a container with the contents of
my current commit and opens my browser to the URL of the page hosted by the locally
running webserver:

.PHONY: run_local
run_local:
     podman stop mansmk ; podman rm mansmk ; podman run --name=mansmk --rm
-p $(HOST_ADDR):$(HOST_PORT):$(TARGET_PORT) -d $(IMAGE_REF):$(HASH) && $(BROWSER)
$(HOST_URL):$(HOST_PORT)

I also use a variable for the browser name, so I can test with several if I want to. By default, it
will open in Firefox when I run make run_local. If I want to test the same thing in Google, I
run make run_local BROWSER="google-chrome".

When working with containers and container images, cleaning up old containers and images is
an annoying chore, especially when you iterate frequently. I include targets in my Makefile for
handling these tasks, too. When cleaning up a container, if the container doesn't exist,
Podman or Docker will return with an exit code of 125. Unfortunately, make expects each
command to return 0 or it will stop processing, so I use a wrapper script to handle that case:



Creative Commons Attribution Share-alike 4.0                                                          16
#!/usr/bin/env bash
ID="${@}"
podman stop ${ID} 2>/dev/null
if [[ $? == 125 ]]
then
   # No such container
   exit 0
elif [[ $? == 0 ]]
then
   podman rm ${ID} 2>/dev/null
else
   exit $?
fi


Cleaning images requires a bit more logic, but it can all be done within the Makefile. To do this
easily, I add a label (via the Containerfile) to the image when it's being built. This makes it easy
to find all the images with these labels. The most recent of these images can be identified by
looking for the :latest tag. Finally, all of the images, except those pointing to the image
tagged with :latest, can be deleted:

clean_images:
     $(eval LATEST_IMAGES := $(shell podman images --filter
"label=my-project.purpose=app-image" --no-trunc | awk '/latest/ {print $$3}'))
     podman images --filter "label=my-project.purpose=app-image" –no-trunc
--quiet | grep -v $(LATEST_IMAGES) | xargs --no-run-if-empty --max-lines=1 podman
image rm

This is the point where using a Makefile for managing container projects really comes
together into something cool. To this point, the Makefile includes commands for building and
tagging images, testing, pushing images, rolling out a new version, cleaning up a container,
cleaning up images, and running a local version. Running each of these with make
image_build && make image_tag && make test… etc. is considerably easier than
running each of the original commands, but it can be simplified even further.

A Makefile can group commands into a target, allowing multiple targets to run with a single
command. For example, my Makefile groups the image_build and image_tag targets
under the build target, so I can run both by simply using make build. Even better, these
targets can be further grouped into the default make target, all, allowing me to run all of
them in order by executing make all or more simply, make.

For my project, I want the default make action to include everything from building the image
to testing, deploying, and cleaning up, so I include the following targets:



Creative Commons Attribution Share-alike 4.0                                                      17
.PHONY: all

all: build test deploy clean

.PHONY: build image_build image_tag

build: image_build image_tag

.PHONY: deploy push rollout

deploy: push rollout

.PHONY: clean clean_containers clean_images

clean: clean_containers clean_images


This does everything I've talked about in this article, except the make run_local target, in a
single command: make.


Conclusion
A Makefile is an excellent way to manage a container-based project. By combining all the
commands necessary to build, test, and deploy a project into make targets within the
Makefile, all the "meta" work—everything aside from writing the code—can be simplified and
automated. The Makefile can even be used for code-related tasks: running unit tests,
maintaining modules, compiling binaries and checksums. While it can't yet write code for you,
using a Makefile combined with the benefits of a containerized, cloud-based service can make
(wink, wink) managing many aspects of a project much easier.




Creative Commons Attribution Share-alike 4.0                                                 18
Podman: A more secure way to run
containers

By Daniel J Walsh

Before I get into the main topic of this article, Podman and containers, I need to get a little
technical about the Linux audit feature.


What is audit?
The Linux kernel has an interesting security feature called audit. It allows administrators to
watch for security events on a system and have them logged to the audit.log, which can be
stored locally or remotely on another machine to prevent a hacker from trying to cover his
tracks.

The /etc/shadow file is a common security file to watch, since adding a record to it could
allow an attacker to get return access to the system. Administrators want to know if any
process modified the file. You can do this by executing the command:

# auditctl -w /etc/shadow

Now let's see what happens if I modify the /etc/shadow file:

# touch /etc/shadow
# ausearch -f /etc/shadow -i -ts recent
type=PROCTITLE msg=audit(10/10/2018 09:46:03.042:4108) : proctitle=touch
/etc/shadow
type=SYSCALL msg=audit(10/10/2018 09:46:03.042:4108) : arch=x86_64 syscall=openat
success=yes exit=3 a0=0xffffff9c a1=0x7ffdb17f6704 a2=O_WRONLY|O_CREAT|O_NOCTTY|
O_NONBLOCK a3=0x1b6 items=2 ppid=2712 pid=3727 auid=dwalsh uid=root gid=root
euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts1 ses=3
comm=touch
exe=/usr/bin/touch subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
key=(null)




Creative Commons Attribution Share-alike 4.0                                                      19
There's a lot of information in the audit record, but I highlighted that it recorded that root
modified the /etc/shadow file and the owner of the process' audit UID (auid) was dwalsh.

Did the kernel do that?

Tracking the login UID
There is a field called loginuid, stored in /proc/self/loginuid, that is part of the proc struct
of every process on the system. This field can be set only once; after it is set, the kernel will
not allow any process to reset it.

When I log into the system, the login program sets the loginuid field for my login process.

My UID, dwalsh, is 3267.

$ cat /proc/self/loginuid
3267

Now, even if I become root, my login UID stays the same.

$ sudo cat /proc/self/loginuid
3267

Note that every process that's forked and executed from the initial login process
automatically inherits the loginuid. This is how the kernel knew that the person who logged
was dwalsh.


Containers
Now let's look at containers.

$ sudo podman run fedora cat /proc/self/loginuid
3267

Even the container process retains my loginuid. Now let's try with Docker.

$ sudo docker run fedora cat /proc/self/loginuid
4294967295


Why the difference?
Podman uses a traditional fork/exec model for the container, so the container process is an
offspring of the Podman process. Docker uses a client/server model. The docker command I
executed is the Docker client tool, and it communicates with the Docker daemon via a



Creative Commons Attribution Share-alike 4.0                                                        20
client/server operation. Then the Docker daemon creates the container and handles
communications of stdin/stdout back to the Docker client tool.

The default loginuid of processes (before their loginuid is set) is 4294967295. Since the
container is an offspring of the Docker daemon and the Docker daemon is a child of the init
system, we see that systemd, Docker daemon, and the container processes all have the same
loginuid, 4294967295, which audit refers to as the unset audit UID.

cat /proc/1/loginuid
4294967295


How can this be abused?
Let's look at what would happen if a container process launched by Docker modifies the
/etc/shadow file.

$ sudo docker run --privileged -v /:/host fedora touch /host/etc/shadow
$ sudo ausearch -f /etc/shadow -i
type=PROCTITLE msg=audit(10/10/2018 10:27:20.055:4569) :
proctitle=/usr/bin/coreutils
--coreutils-prog-shebang=touch /usr/bin/touch /host/etc/shadow
type=SYSCALL msg=audit(10/10/2018 10:27:20.055:4569) : arch=x86_64 syscall=openat
success=yes exit=3 a0=0xffffff9c a1=0x7ffdb6973f50 a2=O_WRONLY|O_CREAT|O_NOCTTY|
O_NONBLOCK a3=0x1b6 items=2 ppid=11863 pid=11882 auid=unset uid=root gid=root
euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none)
ses=unset
comm=touch exe=/usr/bin/coreutils subj=system_u:system_r:spc_t:s0 key=(null)

In the Docker case, the auid is unset (4294967295); this means the security officer might
know that a process modified the /etc/shadow file but the identity was lost.

If that attacker then removed the Docker container, there would be no trace on the system of
who modified the /etc/shadow file.

Now let's look at the exact same scenario with Podman.

$ sudo podman run --privileged -v /:/host fedora touch /host/etc/shadow
$ sudo ausearch -f /etc/shadow -i
type=PROCTITLE msg=audit(10/10/2018 10:23:41.659:4530) :
proctitle=/usr/bin/coreutils
--coreutils-prog-shebang=touch /usr/bin/touch /host/etc/shadow
type=SYSCALL msg=audit(10/10/2018 10:23:41.659:4530) : arch=x86_64 syscall=openat
success=yes exit=3 a0=0xffffff9c a1=0x7fffdffd0f34 a2=O_WRONLY|O_CREAT|O_NOCTTY|
O_NONBLOCK a3=0x1b6 items=2 ppid=11671 pid=11683 auid=dwalsh uid=root gid=root
euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=3




Creative Commons Attribution Share-alike 4.0                                                  21
comm=touch
exe=/usr/bin/coreutils subj=unconfined_u:system_r:spc_t:s0 key=(null)

Everything is recorded correctly with Podman since it uses traditional fork/exec.

This was just a simple example of watching the /etc/shadow file, but the auditing system
is very powerful for watching what processes do on a system. Using a fork/exec container
runtime for launching containers (instead of a client/server container runtime) allows you to
maintain better security through audit logging.


Final thoughts
There are many other nice features about the fork/exec model versus the client/server model
when launching containers. For example, systemd features include:

    • SD_NOTIFY: If you put a Podman command into a systemd unit file, the container
      process can return notice up the stack through Podman that the service is ready to
      receive tasks. This is something that can't be done in client/server mode.
    • Socket activation: You can pass down connected sockets from systemd to Podman
      and onto the container process to use them. This is impossible in the client/server
      model.

The nicest feature, in my opinion, is running Podman and containers as a non-root user.
This means you never have give a user root privileges on the host, while in the client/server
model (like Docker employs), you must open a socket to a privileged daemon running as root
to launch the containers. There you are at the mercy of the security mechanisms
implemented in the daemon versus the security mechanisms implemented in the host
operating systems—a dangerous proposition.




Creative Commons Attribution Share-alike 4.0                                                    22
How to SSH into a running container

By Seth Kenlon

Containers have shifted the way we think about virtualization. You may remember the days (or
you may still be living them) when a virtual machine was the full stack, from virtualized BIOS,
operating system, and kernel up to each virtualized network interface controller (NIC). You
logged into the virtual box just as you would your own workstation. It was a very direct and
simple analogy.

And then containers came along, starting with LXC and culminating in the Open Container
Initiative (OCI), and that's when things got complicated.


Idempotency
In the world of containers, the "virtual machine" is only mostly virtual. Everything that doesn't
need to be virtualized is borrowed from the host machine. Furthermore, the container itself is
usually meant to be ephemeral and idempotent, so it stores no persistent data, and its state is
defined by configuration files on the host machine.

If you're used to the old ways of virtual machines, then you naturally expect to log into a
virtual machine in order to interact with it. But containers are ephemeral, so anything you do
in a container is forgotten, by design, should the container need to be restarted or respawned.

The commands controlling your container infrastructure (such as oc, crictl, lxc, and docker)
provide an interface to run important commands to restart services, view logs, confirm the
existence and permissions modes of an important file, and so on. You should use the tools
provided by your container infrastructure to interact with your application, or else edit
configuration files and relaunch. That's what containers are designed to do.

For instance, the open source forum software Discourse is officially distributed as a container
image. The Discourse software is stateless, so its installation is self-contained within
/var/discourse. As long as you have a backup of /var/discourse, you can always restore the

Creative Commons Attribution Share-alike 4.0                                                   23
forum by relaunching the container. The container holds no persistent data, and its
configuration file is /var/discourse/containers/app.yml.

Were you to log into the container and edit any of the files it contains, all changes would be
lost if the container had to be restarted.

LXC containers you're building from scratch are more flexible, with configuration files (in a
location defined by you) passed to the container when you launch it.

A build system like Jenkins usually has a default configuration file, such as jenkins.yaml,
providing instructions for a base container image that exists only to build and run tests on
source code. After the builds are done, the container goes away.

Now that you know you don't need SSH to interact with your containers, here's an overview of
what tools are available (and some notes about using SSH in spite of all the fancy tools that
make it redundant).


OpenShift web console
OpenShift 4 offers an open source toolchain for container creation and maintenance,
including an interactive web console.




When you log into your web console, navigate to your project overview and click the
Applications tab for a list of pods. Select a (running) pod to open the application's Details
panel.



Creative Commons Attribution Share-alike 4.0                                                     24
Click the Terminal tab at the top of the Details panel to open an interactive shell in your
container.




If you prefer a browser-based experience for Kubernetes management, you can learn more
through interactive lessons available at learn.openshift.com.


OpenShift oc
If you prefer a command-line interface experience, you can use the oc command to interact
with containers from the terminal.

First, get a list of running pods (or refer to the web console for a list of active pods). To get
that list, enter:

$ oc get pods


You can view the logs of a resource (a pod, build, or container). By default, oc logs returns the
logs from the first container in the pod you specify. To select a single container, add the --
container option:

$ oc logs --follow=true example-1-e1337 --container app


You can also view logs from all containers in a pod with:


Creative Commons Attribution Share-alike 4.0                                                        25
$ oc logs --follow=true example-1-e1337 --all-containers



Execute commands
You can execute commands remotely with:

$ oc exec example-1-e1337 --container app hostname example.local

This is similar to running SSH non-interactively: you get to run the command you want to run
without an interactive shell taking over your environment.

Remote shell
You can attach to a running container. This still does not open a shell in the container, but it
does run commands directly. For example:

$ oc attach example-1-e1337 --container app


If you need a true interactive shell in a container, you can open a remote shell with the oc rsh
command as long as the container includes a shell. By default, oc rsh launches /bin/sh:

$ oc rsh example-1-e1337 --container app




Kubernetes
If you're using Kubernetes directly, you can use the kubetcl exec command to run a Bash
shell in your pod.

First, confirm that your pod is running:

$ kubectl get pods


As long as the pod containing your application is listed, you can use the exec command to
launch a shell in the container. Using the name example-pod as the pod name, enter:

$ kubectl exec --stdin=false --tty=false example-pod -- /bin/bash
root@example.local:/# ls
bin core etc    lib   oot srv
boot dev home lib64 sbin tmp var




Creative Commons Attribution Share-alike 4.0                                                       26
Docker and Podman
The docker and podman commands are similar to kubectl. The Podman project doesn't
require a daemon, while Docker requires dockerd. To get the name of a running container
(you may have to use sudo to escalate privileges if you're not in the appropriate group), use
either the podman or docker command (depending on which you've got installed):

$ podman ps
CONTAINER ID      IMAGE         COMMAND        NAME
678ac5cca78e       centos       "/bin/bash"     example-centos


Using the container name, you can run a command in the container:

$ docker exec example/centos cat /etc/os-release
CentOS Linux release 7.6
NAME="CentOS Linux"
VERSION="7"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
[...]


Or you can launch a Bash shell for an interactive session:

$ podman exec -it example-centos /bin/bash




Containers and appliances
The important thing to remember when dealing with the cloud is that containers
are essentially runtimes rather than virtual machines. While they have much in common with a
Linux system (because they are a Linux system!), they rarely translate directly to the
commands and workflow you may have developed on your Linux workstation. However, like
appliances, containers have an interface to help you develop, maintain, and monitor them, so
get familiar with the front-end commands and services until you're happily interacting with
them just as easily as you interact with virtual (or bare-metal) machines. Soon, you'll wonder
why everything isn't developed to be ephemeral.




Creative Commons Attribution Share-alike 4.0                                                  27
Run containers on Linux without
sudo in Podman

By Seth Kenlon

Containers are an important part of modern computing, and as the infrastructure around
containers evolves, new and better tools have started to surface. It used to be that you could
run containers with just LXC, and then Docker gained popularity, and things started getting
more complex. Eventually, we got the container management system we all deserved with
Podman, a daemonless container engine that makes containers and pods easy to build, run,
and manage.

Containers interface directly with Linux kernel abilities like cgroups and namespaces, and they
spawn lots of new processes within those namespaces. In short, running a container is literally
running a Linux system inside a Linux system. From the operating system's viewpoint, it looks
very much like an administrative and privileged activity. Normal users don't usually get to have
free reign over system resources the way containers demand, so by default, root or
sudo permissions are required to run Podman. However, that's only the default setting, and
it's by no means the only setting available or intended. This article demonstrates how to
configure your Linux system so that a normal user can run Podman without the use of sudo
("rootless").


Namespace user IDs
A kernel namespace is essentially an imaginary construct that helps Linux keep track of what
processes belong together. It's the red queue ropes of Linux. There's not actually a difference
between processes in one queue and another, but it's helpful to cordon them off from one
another. Keeping them separate is the key to declaring one group of processes a "container"
and the other group of processes your OS.



Creative Commons Attribution Share-alike 4.0                                                  28
Linux tracks what user or group owns each process by User ID (UID) and Group ID (GID).
Normally, a user has access to a thousand or so subordinate UIDs to assign to child processes
in a namespace. Because Podman runs an entire subordinate operating system assigned to
the user who started the container, you need a lot more than the default allotment of subuids
and subgids.

You can grant a user more subuids and subgids with the usermod command. For example, to
grant more subuids and subgids to the user tux, choose a suitably high UID that has no user
assigned to it (such as 200,000) and increment it by several thousand:

$ sudo usermod \
--add-subuids 200000-265536 \
--add-subgids 200000-265536 \
tux




Namespace access
There are limits on namespaces, too. This usually gets set very high, but you can verify the
user allotment of namespaces with systctl, the kernel parameter tool:

$ sysctl --all --pattern user_namespaces
user.max_user_namespaces = 28633


That's plenty of namespaces, and it's probably what your distribution has set by default. If
your distribution doesn't have that property or has it set very low, then you can create it by
entering this text into the file /etc/sysctl.d/userns.conf:

user.max_user_namespaces=28633


Load that setting:

$ sudo sysctl -p /etc/sysctl.d/userns.conf




Run a container without root
Once you've got your configuration set, reboot your computer to ensure that the changes to
your user and kernel parameters are loaded and active.

After you reboot, try running a container image:



Creative Commons Attribution Share-alike 4.0                                                     29
$ podman run -it busybox echo "hello"
hello




Containers like commands
Containers may feel mysterious if you're new to them, but actually, they're no different than
your existing Linux system. They are literally processes running on your system, without the
cost or barrier of an emulated environment or virtual machine. All that separates a container
from your OS are kernel namespaces, so they're really just native processes with different
labels on them. Podman makes this more evident than ever, and once you configure Podman
to be a rootless command, containers feel more like commands than virtual environments.
Podman makes containers and pods easy, so give it a try.




Creative Commons Attribution Share-alike 4.0                                                   30
What is a container image?

By Nived Velayudhan

Containers are a critical part of today's IT operations. A container image contains a packaged
application, along with its dependencies, and information on what processes it runs when
launched.

You create container images by providing a set of specially formatted instructions, either as
commits to a registry or as a Dockerfile. For example, this Dockerfile creates a container for a
PHP web application:

FROM registry.access.redhat.com/ubi8/ubi:8.1
RUN yum --disableplugin=subscription-manager -y module enable php:7.3 \
  && yum --disableplugin=subscription-manager -y install httpd php \
 && yum --disableplugin=subscription-manager clean all
ADD index.php /var/www/html
RUN sed -i 's/Listen 80/Listen 8080/' /etc/httpd/conf/httpd.conf \
  && sed -i 's/listen.acl_users = apache,nginx/listen.acl_users =/' /etc/php-
fpm.d/www.conf \
  && mkdir /run/php-fpm \
  && chgrp -R 0 /var/log/httpd /var/run/httpd /run/php-fpm \
  && chmod -R g=u /var/log/httpd /var/run/httpd /run/php-fpm
EXPOSE 8080
USER 1001
CMD php-fpm & httpd -D FOREGROUND


Each instruction in this file adds a layer to the container image. Each layer only adds the
difference from the layer below it, and then, all these layers are stacked together to form a
read-only container image.


How does that work?
You need to know a few things about container images, and it's important to understand the
concepts in this order:


Creative Commons Attribution Share-alike 4.0                                                    31
    1. Union file systems
    2. Copy-on-Write
    3. Overlay File Systems
    4. Snapshotters


Union File Systems (Aufs)
The Union File System (UnionFS) is built into the Linux kernel, and it allows contents from one
file system to be merged with the contents of another, while keeping the "physical" content
separate. The result is a unified file system, even though the data is actually structured in
branches.

The idea here is that if you have multiple images with some identical data, instead of having
this data copied over again, it's shared by using something called a layer.




                               Image CC BY-SA opensource.com

Each layer is a file system that can be shared across multiple containers, e.g., The httpd base
layer is the official Apache image and can be used across any number of containers. Imagine
the disk space we just saved since we are using the same base layer for all our containers.

These image layers are always read-only, but when we create a new container from this image,
we add a thin writable layer on top of it. This writable layer is where you create/modify/delete
or make other changes required for each container.


Creative Commons Attribution Share-alike 4.0                                                    32
Copy-on-write
When you start a container, it appears as if the container has an entire file system of its own.
That means every container you run in the system needs its own copy of the file system.
Wouldn't this take up a lot of disk space and also take a lot of time for the containers to boot?
No—because every container does not need its own copy of the filesystem!

Containers and images use a copy-on-write mechanism to achieve this. Instead of copying
files, the copy-on-write strategy shares the same instance of data to multiple processes and
copies only when a process needs to modify or write data. All other processes would continue
to use the original data. Before any write operation is performed in a running container, a copy
of the file to be modified is placed on the writeable layer of the container. This is where the
write takes place. Now you know why it's called copy-on-write.

This strategy optimizes both image disk space usage and the performance of container start
times and works in conjunction with UnionFS.


Overlay File System
An overlay sits on top of an existing filesystem, combines an upper and lower directory tree,
and presents them as a single directory. These directories are called layers. The lower layer
remains unmodified. Each layer adds only the difference (the diff, in computing terminology)
from the layer below it, and this unification process is referred to as a union mount.

The lowest directory or an Image layer is called lowerdir, and the upper directory is called
upperdir. The final overlayed or unified layer is called merged.




Creative Commons Attribution Share-alike 4.0                                                      33
Common terminology consists of these layer definitions:

    • Base layer is where the files of your filesystem are located. In terms of container
       images, this layer would be your base image.
    • Overlay layer is often called the container layer, as all the changes that are made to a
       running container, as adding, deleting, or modifying files, are written to this writable
       layer. All changes made to this layer are stored in the next layer, and is a union view of
       the Base and Diff layers.
    • Diff layer contains all changes made in the Overlay layer. If you write something that's
       already in the Base layer, then the overlay file system copies the file to the Diff layer
       and makes the modifications you intended to write. This is called a copy-on-write.


Snapshotters
Containers can build, manage, and distribute changes as a part of their container filesystem
using layers and graph drivers. But working with graph drivers is really complicated and is
error-prone. SnapShotters are different from graph drivers, as they have no knowledge of
images or containers.

Snapshotters work very similar to Git, such as the concept of having trees, and tracking
changes to trees for each commit. A snapshot represents a filesystem state. Snapshots have
parent-child relationships using a set of directories. A diff can be taken between a parent and
its snapshot to create a layer.

The Snapshotter provides an API for allocating, snapshotting, and mounting abstract, layered
file systems.


Wrap up
You now have a good sense of what container images are and how their layered approach
makes containers portable. Next up, I'll cover container runtimes and internals.




Creative Commons Attribution Share-alike 4.0                                                       34
4 Linux technologies fundamental
to containers

By Nived Velayudhan

In previous articles, I have written about container images and runtimes. In this article, I look at
how containers are made possible by a foundation of some special Linux technologies,
including namespaces and control groups.




                               (Nived Velayudhan, CC BY-SA 4.0)




Creative Commons Attribution Share-alike 4.0                                                     35
Linux technologies make up the foundations of building and running a container process on
your system. Technologies include:

    1. Namespaces
    2. Control groups (cgroups)
    3. Seccomp
    4. SELinux


Namespaces
Namespaces provide a layer of isolation for the containers by giving the container a view of
what appears to be its own Linux filesystem. This limits what a process can see and therefore
restricts the resources available to it.

There are several namespaces in the Linux kernel that are used by Docker or Podman and
others while creating a container:


$ docker container run alpine ping 8.8.8.8
$ sudo lsns -p 29413
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup    299  1 root /usr/lib/systemd/systemd --
switched...
4026533105 mnt 1 29413 root ping 8.8.8.8
4026533106 uts 1 29413 root ping 8.8.8.8
4026533105 ipc 1 29413 root ping 8.8.8.8
[...]



User
The user namespace isolates users and groups within a container. This is done by allowing
containers to have a different view of UID and GID ranges compared to the host system. The
user namespace enables the software to run inside the container as the root user. If an
intruder attacks the container and then escapes to the host machine, they're confined to only
a non-root identity.

Mnt
The mnt namespace allows the containers to have their own view of the system's file system
hierarchy. You can find the mount points for each container process in the
/proc/<PID>/mounts location in your Linux system.


Creative Commons Attribution Share-alike 4.0                                                   36
UTS
The Unix Timesharing System (UTS) namespace allows containers to have a unique hostname
and domain name. When you run a container, a random ID is used as the hostname even when
using the — name tag. You can use the unshare command to get an idea of how this works.

$ docker container run -it --name nived alpine sh
# hostname
9c9a5edabdd6
#
$ sudo unshare -u sh
# hostname isolated.hostname
# hostname
# exit
$ hostname
homelab.redhat.com



IPC
The Inter-Process Communication (IPC) namespace allows different container processes to
communicate by accessing a shared range of memory or using a shared message queue.

# ipcmk -M 10M
Shared memory id: 0
# ipcmk -M 20M
Shared memory id: 1
# ipcs
---- Message Queues ----
key msqid owner perms used-bytes           messages
---- Shared Memory Segments
key        shmid owner perms bytes         nattch status
0xd1df416a 0     root 644    10485760      0
0xbd487a9d 1     root 644    20971520      0
[...]



PID
The Process ID (PID) namespace ensures that the processes running inside a container are
isolated from the external world. When you run a ps command inside a container, you only see
the processes running inside the container and not on the host machine because of this
namespace.




Creative Commons Attribution Share-alike 4.0                                               37
Net
The network namespace allows the container to have its own view of network interface, IP
addresses, routing tables, port numbers, and so on. How does a container able to
communicate to the external world? All containers you create get attached to a special virtual
network interface for communication.


Control groups (cgroups)
Cgroups are fundamental blocks of making a container. A cgroup allocates and limits
resources such as CPU, memory, network I/O that are used by containers. The container
engine automatically creates a cgroup filesystem of each type, and sets values for each
container when the container is run.


SECCOMP
Seccomp basically stands for secure computing. It is a Linux feature used to restrict the set of
system calls that an application is allowed to make. The default seccomp profile for Docker,
for example, disables around 44 syscalls (over 300 are available).

The idea here is to provide containers access to only those resources which the container
might need. For example, if you don't need the container to change the clock time on your
host machine, you probably have no use for the clock_adjtime and clock_settime syscalls, and
it makes sense to block them out. Similarly, you don't want the containers to change the
kernel modules, so there is no need for them to make create_module, delete_module syscalls.


SELinux
SELinux stands for security-enhanced Linux. If you are running a Red Hat distribution on your
hosts, then SELinux is enabled by default. SELinux lets you limit an application to have access
only to its own files and prevent any other processes from accessing them. So, if an
application is compromised, it would limit the number of files that it can affect or control. It
does this by setting up contexts for files and processes and by defining policies that would
enforce what a process can see and make changes to.

SELinux policies for containers are defined by the container-selinux package. By
default, containers are run with the container_t label and are allowed to read (r) and execute




Creative Commons Attribution Share-alike 4.0                                                       38
(x) under the /usr directory and read most content from the /etc directory. The label
container_var_lib_t is common for files relating to containers.


Wrap up
Containers are a critical part of today's IT infrastructure and a pretty interesting technology,
too. Even if your role doesn't involve containerization directly, understanding a few
fundamental container concepts and approaches gives you an appreciation for how they can
help your organization. The fact that containers are built on open source Linux technologies
makes them even better!




Creative Commons Attribution Share-alike 4.0                                                   39
What are container runtimes?

By Nived Velayudhan

In my examination of container images, I discussed container fundamentals, but now it's time
to delve deeper into container runtimes so you can understand how container environments
are built. The information in this article is in part extracted from the official documentation of
the Open Container Initiative (OCI), the open standard for containers, so this information is
relevant regardless of your container engine.


Container runtimes
So what really happens in the backend when you run a command like podman run or docker
run command? Here is a step-by-step overview for you:

    1. The image is pulled from an image registry if it not available locally
    2. The image is extracted onto a copy-on-write filesystem, and all the container layers
       overlay each other to create a merged filesystem
    3. A container mount point is prepared
    4. Metadata is set from the container image, including settings like overriding CMD,
       ENTRYPOINT from user inputs, setting up SECCOMP rules, etc., to ensure container
       runs as expected
    5. The kernel is alerted to assign some sort of isolation, such as process, networking, and
       filesystem, to this container (namespaces)
    6. The kernel is also alerted to assign some resource limits like CPU or memory limits to
       this container (cgroups)
    7. A system call (syscall) is passed to the kernel to start the container
    8. SELinux/AppArmor is set up

Container runtimes take care of all of the above. When we think about container runtimes, the
things that come to mind are probably runc, lxc, containerd, rkt, cri-o, and so on. Well, you are


Creative Commons Attribution Share-alike 4.0                                                    40
not wrong. These are container engines and container runtimes, and each is built for different
situations.

Container runtimes focus more on running containers, setting up namespace and cgroups for
containers, and are also called lower-level container runtimes. Higher-level container runtimes
or container engines focus on formats, unpacking, management, and image-sharing. They
also provide APIs for developers.


Open Container Initiative (OCI)
The Open Container Initiative (OCI) is a Linux Foundation project. Its purpose is to design
certain open standards or a structure around how to work with container runtimes and
container image formats. It was established in June 2015 by Docker, rkt, CoreOS, and other
industry leaders.

It does this using two specifications:

1. Image Specification (image-spec)
The goal of this specification is to enable the creation of interoperable tools for building,
transporting, and preparing a container image to run.

The high-level components of the spec include:

    • Image Manifest — a document describing the elements that make up a container image
    • Image Index — an annotated index of image manifests
    • Image Layout — a filesystem layout representing the contents of an image
    • Filesystem Layer — a changeset that describes a container’s filesystem
    • Image Configuration — a document determining layer ordering and configuration of the
       image suitable for translation into a runtime bundle
    • Conversion — a document explaining how this translation should occur
    • Descriptor — a reference that describes the type, metadata, and content address of
       referenced content

2. Runtime specification (runtime-spec)
This specification aims to define the configuration, execution environment, and lifecycle of a
container. The config.json file provides the container configuration for all supported platforms
and details the field that enables the creation of a container. The execution environment is



Creative Commons Attribution Share-alike 4.0                                                     41
detailed along with the common actions defined for a container’s lifecycle to ensure that
applications running inside a container have a consistent environment between runtimes.

The Linux container specification uses various kernel features, including namespaces,
cgroups, capabilities, LSM, and filesystem jails to fulfill the spec.


Now you know
Container runtimes are managed by the OCI specifications to provide consistency and
interoperability. Many people use containers without the need to understand how they work,
but understanding containers is a valuable advantage when you need to troubleshoot or
optimize how you use them.




Creative Commons Attribution Share-alike 4.0                                                42