Containers and PIDs

It's quite simple. We have an application with tons of dependencies and decide to move it inside a container for better maintainability. So we write a Dockerfile which installs all dependencies and the application. After we run the container image it will start the application as an isolated process.

While experimenting with OpenWRT I figured out that when I start procd inside a container it will never receive netlink messages. How can that be?

Namespaces

A Linux kernel feature which is used by containers for isolation are namespaces. A namespace is a encapsulated area where you can use the same names/numbers again. There are namespaces for processes, networks, mountpoints, etc. In this article we are interested in the process namespace.

For more information about Linux namesapces see the following article:
https://lwn.net/Articles/531114/

The problem

We take the following code snippet:

struct sockaddr_nl nls = {};
int nlbufsize = 512 * 1024;

rule_file = strdup(rules);
nls.nl_family = AF_NETLINK;
nls.nl_pid = getpid();
nls.nl_groups = -1;

if ((hotplug_fd.fd = socket(PF_NETLINK, SOCK_DGRAM | SOCK_CLOEXEC, NETLINK_KOBJECT_UEVENT)) == -1) {
    ERROR("Failed to open hotplug socket: %m\n");
    exit(1);
}
if (bind(hotplug_fd.fd, (void *)&nls, sizeof(struct sockaddr_nl))) {
    ERROR("Failed to bind hotplug socket: %m\n");
    exit(1);
}

if (setsockopt(hotplug_fd.fd, SOL_SOCKET, SO_RCVBUFFORCE, &nlbufsize, sizeof(nlbufsize)))
    ERROR("Failed to resize receive buffer: %m\n");

If we call getpid() and our program runs inside a container we will for example get pid 12. However, the pid seen by the host system will e.g. be 1234. The Linux kernel isn't aware of this and will try to send the netlink message to pid 12. Therefore, our application will never receive this messages. Unfortunately, there is no way to get the "real" pid from inside a container. This makes sense, because we don't really want an application to be aware that it runs inside a container. The solution to this issue depends on the use case. In this example the solution is easy. We just need to set the nl_pid to 0. Then the kernel will determine the destination pid automatically.

nls.nl_pid = 0;

Conclusion

If we work with containers we need to be aware that an application inside a container will run in different namespaces and therefore has a different view of the system than e.g. the kernel, another container or the host system. We need to avoid setting absolute values for things which depend on namespace.

An example for such a fix is e.g. the following patch: https://git.openwrt.org/?p=project/procd.git;a=commit;h=0aee1c30c046219b8cbd17492f1cb60c68d7e4f3