Evan
Sarmiento
evms@cs.bu.edu
2001
Evan Sarmiento
The Jail Subsystem
On most UNIX systems, root has omnipotent power. This promotes
insecurity. If an attacker were to gain root on a system, he would
have every function at his fingertips. In FreeBSD there are
sysctls which dilute the power of root, in order to minimize the
damage caused by an attacker. Specifically, one of these functions
is called secure levels. Similarly, another function which is
present from FreeBSD 4.0 and onward, is a utility called
&man.jail.8;. Jail chroots an
environment and sets certain restrictions on processes which are
forked from within. For example, a jailed process cannot affect
processes outside of the jail, utilize certain system calls, or
inflict any damage on the main computer.
Jail is becoming the new security
model. People are running potentially vulnerable servers such as
Apache, BIND, and sendmail within jails, so that if an attacker
gains root within the Jail, it is only
an annoyance, and not a devastation. This article focuses on the
internals (source code) of Jail.
It will also suggest improvements upon the jail code base which
are already being worked on. If you are looking for a how-to on
setting up a Jail, I suggest you look
at my other article in Sys Admin Magazine, May 2001, entitled
"Securing FreeBSD using Jail."
Architecture
Jail consists of two realms: the
user-space program, jail, and the code implemented within the
kernel: the jail() system call and associated
restrictions. I will be discussing the user-space program and
then how jail is implemented within the kernel.
Userland code
The source for the user-land jail is located in
/usr/src/usr.sbin/jail, consisting of
one file, jail.c. The program takes these
arguments: the path of the jail, hostname, ip address, and the
command to be executed.
Data Structures
In jail.c, the first thing I would
note is the declaration of an important structure
struct jail j; which was included from
/usr/include/sys/jail.h.
The definition of the jail structure is:
/usr/include/sys/jail.h:
struct jail {
u_int32_t version;
char *path;
char *hostname;
u_int32_t ip_number;
};
As you can see, there is an entry for each of the
arguments passed to the jail program, and indeed, they are
set during its execution.
/usr/src/usr.sbin/jail.c
j.version = 0;
j.path = argv[1];
j.hostname = argv[2];
Networking
One of the arguments passed to the Jail program is an IP
address with which the jail can be accessed over the
network. Jail translates the ip address given into network
byte order and then stores it in j (the jail structure).
/usr/src/usr.sbin/jail/jail.c:
struct in.addr in;
...
i = inet.aton(argv[3], );
...
j.ip_number = ntohl(in.s.addr);
The
inet_aton3
function "interprets the specified character string as an
Internet address, placing the address into the structure
provided." The ip number node in the jail structure is set
only when the ip address placed onto the in structure by
inet aton is translated into network byte order by
ntohl().
Jailing The Process
Finally, the userland program jails the process, and
executes the command specified. Jail now becomes an
imprisoned process itself and forks a child process which
then executes the command given using &man.execv.3;
/usr/src/sys/usr.sbin/jail/jail.c
i = jail();
...
i = execv(argv[4], argv + 4);
As you can see, the jail function is being called, and
its argument is the jail structure which has been filled
with the arguments given to the program. Finally, the
program you specify is executed. I will now discuss how Jail
is implemented within the kernel.
Kernel Space
We will now be looking at the file
/usr/src/sys/kern/kern_jail.c. This is
the file where the jail system call, appropriate sysctls, and
networking functions are defined.
sysctls
In kern_jail.c, the following
sysctls are defined:
/usr/src/sys/kern/kern_jail.c:
int jail_set_hostname_allowed = 1;
SYSCTL_INT(_jail, OID_AUTO, set_hostname_allowed, CTLFLAG_RW,
_set_hostname_allowed, 0,
"Processes in jail can set their hostnames");
int jail_socket_unixiproute_only = 1;
SYSCTL_INT(_jail, OID_AUTO, socket_unixiproute_only, CTLFLAG_RW,
_socket_unixiproute_only, 0,
"Processes in jail are limited to creating UNIX/IPv4/route sockets only
");
int jail_sysvipc_allowed = 0;
SYSCTL_INT(_jail, OID_AUTO, sysvipc_allowed, CTLFLAG_RW,
_sysvipc_allowed, 0,
"Processes in jail can use System V IPC primitives");
Each of these sysctls can be accessed by the user
through the sysctl program. Throughout the kernel, these
specific sysctls are recognized by their name. For example,
the name of the first sysctl is
jail.set.hostname.allowed.
&man.jail.2; system call
Like all system calls, the &man.jail.2; system call takes
two arguments, struct proc *p and
struct jail_args
*uap. p is a pointer to a proc
structure which describes the calling process. In this
context, uap is a pointer to a structure which specifies the
arguments given to &man.jail.2; from the userland program
jail.c. When I described the userland
program before, you saw that the &man.jail.2; system call was
given a jail structure as its own argument.
/usr/src/sys/kern/kern_jail.c:
int
jail(p, uap)
struct proc *p;
struct jail_args /* {
syscallarg(struct jail *) jail;
} */ *uap;
Therefore, uap->jail would access the
jail structure which was passed to the system call. Next,
the system call copies the jail structure into kernel space
using the copyin()
function. copyin() takes three arguments:
the data which is to be copied into kernel space,
uap->jail, where to store it,
j and the size of the storage. The jail
structure uap->jail is copied into kernel
space and stored in another jail structure,
j.
/usr/src/sys/kern/kern_jail.c:
error = copyin(uap->jail, , sizeof j);
There is another important structure defined in
jail.h. It is the prison structure
(pr). The prison structure is used
exclusively within kernel space. The &man.jail.2; system call
copies everything from the jail structure onto the prison
structure. Here is the definition of the prison structure.
/usr/include/sys/jail.h:
struct prison {
int pr_ref;
char pr_host[MAXHOSTNAMELEN];
u_int32_t pr_ip;
void *pr_linux;
};
The jail() system call then allocates memory for a
pointer to a prison structure and copies data between the two
structures.
/usr/src/sys/kern/kern_jail.c:
MALLOC(pr, struct prison *, sizeof *pr , M_PRISON, M_WAITOK);
bzero((caddr_t)pr, sizeof *pr);
error = copyinstr(j.hostname, pr_host]]>, sizeof pr->pr_host, 0);
if (error)
goto bail;
Finally, the jail system call chroots the path
specified. The chroot function is given two arguments. The
first is p, which represents the calling process, the second
is a pointer to the structure chroot args. The structure
chroot args contains the path which is to be chrooted. As
you can see, the path specified in the jail structure is
copied to the chroot args structure and used.
/usr/src/sys/kern/kern_jail.c:
ca.path = j.path;
error = chroot(p, );
These next three lines in the source are very important,
as they specify how the kernel recognizes a process as
jailed. Each process on a Unix system is described by its
own proc structure. You can see the whole proc structure in
/usr/include/sys/proc.h. For example,
the p argument in any system call is actually a pointer to
that process' proc structure, as stated before. The proc
structure contains nodes which can describe the owner's
identity (p_cred), the process resource
limits (p_limit), and so on. In the
definition of the process structure, there is a pointer to a
prison structure. (p_prison).
/usr/include/sys/proc.h:
struct proc {
...
struct prison *p_prison;
...
};
In kern_jail.c, the function then
copies the pr structure, which is filled with all the
information from the original jail structure, over to the
p->p_prison structure. It then does a
bitwise OR of p->p_flag with the constant
P_JAILED, meaning that the calling
process is now recognized as jailed. The parent process of
each process, forked within the jail, is the program jail
itself, as it calls the &man.jail.2; system call. When the
program is executed through execve, it inherits the
properties of its parents proc structure, therefore it has
the p->p_flag set, and the
p->p_prison structure is filled.
/usr/src/sys/kern/kern_jail.c
p->p.prison = pr;
p->p.flag |= P.JAILED;
When a process is forked from a parent process, the
&man.fork.2; system call deals differently with imprisoned
processes. In the fork system call, there are two pointers
to a proc structure p1
and p2. p1 points to
the parent's proc structure and p2 points
to the child's unfilled proc
structure. After copying all relevant data between the
structures, &man.fork.2; checks if the structure
p->p_prison is filled on
p2. If it is, it increments the
pr.ref by one, and sets the
p_flag to one on the child process.
/usr/src/sys/kern/kern_fork.c:
if (p2->p_prison) {
p2->p_prison->pr_ref++;
p2->p_flag |= P_JAILED;
}
Restrictions
Throughout the kernel there are access restrictions relating
to jailed processes. Usually, these restrictions only check if
the process is jailed, and if so, returns an error. For
example:
if (p->p_prison)
return EPERM;
SysV IPC
System V IPC is based on messages. Processes can send each
other these messages which tell them how to act. The functions
which deal with messages are: msgsys,
msgctl, msgget,
msgsend and msgrcv.
Earlier, I mentioned that there were certain sysctls you could
turn on or off in order to affect the behavior of Jail. One of
these sysctls was jail_sysvipc_allowed. On
most systems, this sysctl is set to 0. If it were set to 1, it
would defeat the whole purpose of having a jail; privleged
users from within the jail would be able to affect processes
outside of the environment. The difference between a message
and a signal is that the message only consists of the signal
number.
/usr/src/sys/kern/sysv_msg.c:
&man.msgget.3;: msgget returns (and possibly
creates) a message descriptor that designates a message queue
for use in other system calls.
&man.msgctl.3;: Using this function, a process
can query the status of a message
descriptor.
&man.msgsnd.3;: msgsnd sends a message to a
process.
&man.msgrcv.3;: a process receives messages using
this function
In each of these system calls, there is this
conditional:
/usr/src/sys/kern/sysv msg.c:
if (!jail.sysvipc.allowed && p->p_prison != NULL)
return (ENOSYS);
Semaphore system calls allow processes to synchronize
execution by doing a set of operations atomically on a set of
semaphores. Basically semaphores provide another way for
processes lock resources. However, process waiting on a
semaphore, that is being used, will sleep until the resources
are relinquished. The following semaphore system calls are
blocked inside a jail: semsys,
semget, semctl and
semop.
/usr/src/sys/kern/sysv_sem.c:
&man.semctl.2;(id, num, cmd, arg):
Semctl does the specified cmd on the semaphore queue
indicated by id.
&man.semget.2;(key, nsems, flag):
Semget creates an array of semaphores, corresponding to
key.
Key and flag take on the same meaning as they
do in msgget.
&man.semop.2;(id, ops, num):
Semop does the set of semaphore operations in the array of
structures ops, to the set of semaphores identified by
id.
System V IPC allows for processes to share
memory. Processes can communicate directly with each other by
sharing parts of their virtual address space and then reading
and writing data stored in the shared memory. These system
calls are blocked within a jailed environment: shmdt,
shmat, oshmctl, shmctl, shmget, and
shmsys.
/usr/src/sys/kern/sysv shm.c:
&man.shmctl.2;(id, cmd, buf):
shmctl does various control operations on the shared memory
region identified by id.
&man.shmget.2;(key, size,
flag): shmget accesses or creates a shared memory
region of size bytes.
&man.shmat.2;(id, addr, flag):
shmat attaches a shared memory region identified by id to the
address space of a process.
&man.shmdt.2;(addr): shmdt
detaches the shared memory region previously attached at
addr.
Sockets
Jail treats the &man.socket.2; system call and related
lower-level socket functions in a special manner. In order to
determine whether a certain socket is allowed to be created,
it first checks to see if the sysctl
jail.socket.unixiproute.only is set. If
set, sockets are only allowed to be created if the family
specified is either PF_LOCAL,
PF_INET or
PF_ROUTE. Otherwise, it returns an
error.
/usr/src/sys/kern/uipc_socket.c:
int socreate(dom, aso, type, proto, p)
...
register struct protosw *prp;
...
{
if (p->p_prison && jail_socket_unixiproute_only &&
prp->pr_domain->dom_family != PR_LOCAL && prp->pr_domain->dom_family != PF_INET
&& prp->pr_domain->dom_family != PF_ROUTE)
return (EPROTONOSUPPORT);
...
}
Berkeley Packet Filter
The Berkeley Packet Filter provides a raw interface to
data link layers in a protocol independent fashion. The
function bpfopen() opens an Ethernet
device. There is a conditional which disallows any jailed
processes from accessing this function.
/usr/src/sys/net/bpf.c:
static int bpfopen(dev, flags, fmt, p)
...
{
if (p->p_prison)
return (EPERM);
...
}
Protocols
There are certain protocols which are very common, such as
TCP, UDP, IP and ICMP. IP and ICMP are on the same level: the
network layer 2. There are certain precautions which are
taken in order to prevent a jailed process from binding a
protocol to a certain port only if the nam
parameter is set. nam is a pointer to a sockaddr structure,
which describes the address on which to bind the service. A
more exact definition is that sockaddr "may be used as a
template for reffering to the identifying tag and length of
each address"[2]. In the function in
pcbbind, sin is a
pointer to a sockaddr.in structure, which contains the port,
address, length and domain family of the socket which is to be
bound. Basically, this disallows any processes from jail to be
able to specify the domain family.
/usr/src/sys/kern/netinet/in_pcb.c:
int in.pcbbind(int, nam, p)
...
struct sockaddr *nam;
struct proc *p;
{
...
struct sockaddr.in *sin;
...
if (nam) {
sin = (struct sockaddr.in *)nam;
...
if (sin->sin_addr.s_addr != INADDR_ANY)
if (prison.ip(p, 0, ->sin.addr.s_addr))
return (EINVAL);
....
}
...
}
You might be wondering what function
prison_ip() does. prison.ip is given three
arguments, the current process (represented by
p), any flags, and an ip address. It
returns 1 if the ip address belongs to a jail or 0 if it does
not. As you can see from the code, if it is indeed an ip
address belonging to a jail, the protcol is not allowed to
bind to a certain port.
/usr/src/sys/kern/kern_jail.c:
int prison_ip(struct proc *p, int flag, u_int32_t *ip) {
u_int32_t tmp;
if (!p->p_prison)
return (0);
if (flag)
tmp = *ip;
else tmp = ntohl (*ip);
if (tmp == INADDR_ANY) {
if (flag)
*ip = p->p_prison->pr_ip;
else *ip = htonl(p->p_prison->pr_ip);
return (0);
}
if (p->p_prison->pr_ip != tmp)
return (1);
return (0);
}
Jailed users are not allowed to bind services to an ip
which does not belong to the jail. The restriction is also
written within the function in_pcbbind:
/usr/src/sys/net inet/in_pcb.c
if (nam) {
...
lport = sin->sin.port;
... if (lport) {
...
if (p && p->p_prison)
prison = 1;
if (prison &&
prison_ip(p, 0, ->sin_addr.s_addr))
return (EADDRNOTAVAIL);
Filesystem
Even root users within the jail are not allowed to set any
file flags, such as immutable, append, and no unlink flags, if
the securelevel is greater than 0.
/usr/src/sys/ufs/ufs/ufs_vnops.c:
int ufs.setattr(ap)
...
{
if ((cred->cr.uid == 0) && (p->prison == NULL)) {
if ((ip->i_flags
& (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND)) &&
securelevel > 0)
return (EPERM);
}