Matching Linux Processes Through ptrace
28/Jun/2017
When I presented my recent work on Varan recently, more than one person asked me
a variant of the following question:
I’ve been using ptrace
to force the exact same execution on two processes by
ensuring both read the same data from their system calls. Still, at some point,
they start behaving differently (they diverge). Why?
ptrace
is a feature of the Linux kernel. Quoting it’s manual page:
The ptrace() system call provides a means by which one process (the “tracer”)
may observe and control the execution of another process (the “tracee”), and
examine and change the tracee’s memory and registers. It is primarily used to
implement breakpoint debugging and system call tracing.
When the tracee issues a system call, the tracer gets notified with the number
of the system call and its arguments. That alone is quite useful. For
instance, strace
uses ptrace to dump the sequence of system calls that a
process issues.
Besides observing the tracee, ptrace
also allows the tracer to attach the
tracees memory. As a result, the tracer can emulate the tracee’s system calls
without actually executing them.
For instance, let us consider that we have two processes, P1 and P2, running
the same program. We can use ptrace
attach both processes and copy the
results of all the system calls from P1 to P2. Given that the kernel
mediates all inputs that a program ever receives, this way we ensure that the
two processes get the same inputs.
Let us consider now a simple example. The following function generates a random
number by XOR-ing the current time, and the process PID.
#define MORE_RANDOMNESS(X) ((X) = 0)
unsigned long random() {
unsigned long ret;
MORE_RANDOMNESS(ret);
struct timeval t;
gettimeofday(&t);
ret ^= t.tv_sec ^ t.tv_usec;
int pid = getpid();
ret ^= (unsigned long)pid;
return ret;
}
This function issues two system calls:
gettimeofday
and
getpid
. Using ptrace
as described above, we can ensure that both processes obtain the same time and
the same PID. Therefore, variables t
, pid
, and ret
should match between
processes P1 and P2. Right?
Wrong
In this post, I explain the main reasons why, according to my experience.
Virtual System Calls
If you run the program that I show above under strace
on an x86 machine, you
will notice the absence of system call gettimeofday
from the trace. Why?
System call gettimeofday
in x86 is a virtual system
call. The Linux kernel
uses virtual system calls to accelerate common system calls that may execute
outside of the kernel. The idea is that such “system calls” execute as a simple
function call, rather than an expensive context switch to kernel mode. The
Linux kernel exposes a virtual dynamic shared object (vDSO) as an ELF file
mapped in every process and discoverable through the auxiliary
vector. Then, at runtime, the C standard
library (libc) scans the auxiliary vector, finds the vDSO, and calls those
functions instead of issuing system calls. Of course, as programs do not issue
system calls directly and use libc instead, this is all completely transparent
to the program using such virtual system calls.
The speed difference between virtual system calls and regular system calls is
quite noticeable. This
program
issues 1.000.000 gettimeofday
system calls, first through libc (and thus using
virtual system calls) and then manually through hand-written assembly. Here’s
the output on my machine:
> gcc -O0 example.c
> time ./a.out # system call
./a.out 0.04s user 0.08s system 97% cpu 0.123 total
> time ./a.out 0 #virtual system call
./a.out a 0.03s user 0.00s system 91% cpu 0.033 total
That’s a 3x difference!
Getting back to why ptrace
fails, in this case there is no system call for
ptrace
to intercept. As a result, P1 and P2 get different values for t
,
thus generating different values for ret
.
glibc get_pid
glibc is the GNU implementation of the C
standard library (libc), and the de-facto libc used in GNU/Linux. Until
recently, glibc optimized system call getpid
by caching the PID when
initializing (before the program’s main
function is called). Then, function
getpid
simply reads the cached PID without issuing any system call.
The program I show above uses getpid
to generate the random number. In our
scenario, if we wait until P1 and P2 reach their respective main
function
to start matching their system calls through ptrace
, we miss the caching of
the PID. Therefore, each process gets a different value for variable pid
.
There’s yet another surprise in how glibc caches the PID. Instead if issuing
the expectable system call getpid
, glibc uses the rather unlikely system call
set_tid_address
for that. So, when using strace
to detect the point in the program execution
where the PID is read and cached, look out for system call set_tid_address
.
This approach of caching the PID has gathered some criticism. Linus Torvalds
himself uses colorful language to describe how he feels about this
optimization. Recently,
glibc removed this
optimization,
which is not present since, and including, version 2.25 (released on
5/Feb/2017).
Non-determinism
The rationale behind synchronizing two processes at the level of system calls
assumes that the programs that the processes execute are deterministic. This is
a reasonable assumption, given how programs typically gather any non-determinism
they use vie the operating system through system calls (e.g., by reading file
/dev/random
, or by getting the date through system call gettimeofday
). As a
result, the programs themselves are deterministic modulus the inputs from the
operating system.
However, and unfortunately, programs can exbibit non-determinism without
interacting with the operating system. This is a basic limitation not only of
ptrace
, but of any other method that matches system calls between processes.
Let’s look at a few examples of such non-determinism that I found in practice.
rdstc
Instruction
The x86 instruction rdtsc
reads the value of the timestamp counter, which is a
64-bit register that increases monotonically at every clock cycle since the
processor was reset. It was introduced with the Pentium processor and it was a
great way to get accurate performance timing for programs. However, it is not
usable for that purpose nowadays anymore, due to out-of-order execution, several
logical cores on the same physical CPU, context switching between threads and
processes, etc.
Still, some existing code uses this instruction to generate low-quality random
data very and monotonically increasing timestamps very fast. For instance, going
back to our example, suppose that macro MORE_RANDOMNESS
is defined as:
#define MORE_RANDOMNESS(X) ((X = rdtsc()))
In this case, the value of variable ret
will never be the same between
processes P1 and P2.
The code example we’ve been following, and this section on instruction rdtsc
,
were inspired by function
mktemp
in glibc.
Function mktemp
takes a template filename ending in XXXXX and returns an
unique filename (e.g., mktemp(/tmp/fileXXXXXX) = /tmp/file123456
). The
example is basically how mktemp
is implemented for
x86.
Undefined Behavior
Let us now consider that macro MORE_RANDOMNESS
is defined as follows:
#define MORE_RANDOMNESS(X) ((X))
This results in a C program with undefined behavior, as variable ret
is read
without being initialized. Of course, a program written in C that displays
undefined behavior has no meaning and any behavior (of the whole program) is
acceptable.
So, a program that displays any undefined behavior is incorrect and we should
limit our discussion to correct C programs, right? Well. Some programs have
intentional undefined behavior. For instance, OpenSSL initializes a random
data generator with an uninitialized byte
array,
which provides a small amount of extra entropy as the runtime value of the
uninitialized array is hard to predict. This is by design and can be disabled
by setting flag -DPURIFY
when building
OpenSSL.
Using a -DPURIFY
build in production did lead to a vulnerability in
Debian worthy of being
mentioned in an xkcd comic. Still, the most
recent version of OpenSSL (1.1.0) uses -DPURIFY
by
default.
Getting back to our example, as with rdtsc
, this means that variable ret
has
a different value between P1 and P2, depending on the “garbage” contents of
the uninitialized stack buffer. However, note that running the same program
compiled by the same compiler with the same optimization level on P1 and P2
may lead to the same “garbage” being left on the stack, which hides this
divergence. Still, changing any of those variables (using different compilers
or different optimization levels) makes processes P1 and P2 diverge.
Multi-threading
Processes that use several threads have access to another form of
non-determinism in the shape of thread scheduling. For instance, we can imagine
a program in which two threads race to grab a lock, and then each thread opens a
file. Which thread wins the race depends on the scheduler, and may be
different between P1 and P2. This issue can be solved with deterministic
scheduling or by imposing a
total order between system calls, as Varan
does.
When I presented my recent work on Varan recently, more than one person asked me a variant of the following question:
I’ve been using
ptrace
to force the exact same execution on two processes by ensuring both read the same data from their system calls. Still, at some point, they start behaving differently (they diverge). Why?
ptrace
is a feature of the Linux kernel. Quoting it’s manual page:
The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.
When the tracee issues a system call, the tracer gets notified with the number
of the system call and its arguments. That alone is quite useful. For
instance, strace
uses ptrace to dump the sequence of system calls that a
process issues.
Besides observing the tracee, ptrace
also allows the tracer to attach the
tracees memory. As a result, the tracer can emulate the tracee’s system calls
without actually executing them.
For instance, let us consider that we have two processes, P1 and P2, running
the same program. We can use ptrace
attach both processes and copy the
results of all the system calls from P1 to P2. Given that the kernel
mediates all inputs that a program ever receives, this way we ensure that the
two processes get the same inputs.
Let us consider now a simple example. The following function generates a random number by XOR-ing the current time, and the process PID.
#define MORE_RANDOMNESS(X) ((X) = 0)
unsigned long random() {
unsigned long ret;
MORE_RANDOMNESS(ret);
struct timeval t;
gettimeofday(&t);
ret ^= t.tv_sec ^ t.tv_usec;
int pid = getpid();
ret ^= (unsigned long)pid;
return ret;
}
This function issues two system calls:
gettimeofday
and
getpid
. Using ptrace
as described above, we can ensure that both processes obtain the same time and
the same PID. Therefore, variables t
, pid
, and ret
should match between
processes P1 and P2. Right?
Wrong
In this post, I explain the main reasons why, according to my experience.
Virtual System Calls
If you run the program that I show above under strace
on an x86 machine, you
will notice the absence of system call gettimeofday
from the trace. Why?
System call gettimeofday
in x86 is a virtual system
call. The Linux kernel
uses virtual system calls to accelerate common system calls that may execute
outside of the kernel. The idea is that such “system calls” execute as a simple
function call, rather than an expensive context switch to kernel mode. The
Linux kernel exposes a virtual dynamic shared object (vDSO) as an ELF file
mapped in every process and discoverable through the auxiliary
vector. Then, at runtime, the C standard
library (libc) scans the auxiliary vector, finds the vDSO, and calls those
functions instead of issuing system calls. Of course, as programs do not issue
system calls directly and use libc instead, this is all completely transparent
to the program using such virtual system calls.
The speed difference between virtual system calls and regular system calls is
quite noticeable. This
program
issues 1.000.000 gettimeofday
system calls, first through libc (and thus using
virtual system calls) and then manually through hand-written assembly. Here’s
the output on my machine:
> gcc -O0 example.c
> time ./a.out # system call
./a.out 0.04s user 0.08s system 97% cpu 0.123 total
> time ./a.out 0 #virtual system call
./a.out a 0.03s user 0.00s system 91% cpu 0.033 total
That’s a 3x difference!
Getting back to why ptrace
fails, in this case there is no system call for
ptrace
to intercept. As a result, P1 and P2 get different values for t
,
thus generating different values for ret
.
glibc get_pid
glibc is the GNU implementation of the C
standard library (libc), and the de-facto libc used in GNU/Linux. Until
recently, glibc optimized system call getpid
by caching the PID when
initializing (before the program’s main
function is called). Then, function
getpid
simply reads the cached PID without issuing any system call.
The program I show above uses getpid
to generate the random number. In our
scenario, if we wait until P1 and P2 reach their respective main
function
to start matching their system calls through ptrace
, we miss the caching of
the PID. Therefore, each process gets a different value for variable pid
.
There’s yet another surprise in how glibc caches the PID. Instead if issuing
the expectable system call getpid
, glibc uses the rather unlikely system call
set_tid_address
for that. So, when using strace
to detect the point in the program execution
where the PID is read and cached, look out for system call set_tid_address
.
This approach of caching the PID has gathered some criticism. Linus Torvalds himself uses colorful language to describe how he feels about this optimization. Recently, glibc removed this optimization, which is not present since, and including, version 2.25 (released on 5/Feb/2017).
Non-determinism
The rationale behind synchronizing two processes at the level of system calls
assumes that the programs that the processes execute are deterministic. This is
a reasonable assumption, given how programs typically gather any non-determinism
they use vie the operating system through system calls (e.g., by reading file
/dev/random
, or by getting the date through system call gettimeofday
). As a
result, the programs themselves are deterministic modulus the inputs from the
operating system.
However, and unfortunately, programs can exbibit non-determinism without
interacting with the operating system. This is a basic limitation not only of
ptrace
, but of any other method that matches system calls between processes.
Let’s look at a few examples of such non-determinism that I found in practice.
rdstc
Instruction
The x86 instruction rdtsc
reads the value of the timestamp counter, which is a
64-bit register that increases monotonically at every clock cycle since the
processor was reset. It was introduced with the Pentium processor and it was a
great way to get accurate performance timing for programs. However, it is not
usable for that purpose nowadays anymore, due to out-of-order execution, several
logical cores on the same physical CPU, context switching between threads and
processes, etc.
Still, some existing code uses this instruction to generate low-quality random
data very and monotonically increasing timestamps very fast. For instance, going
back to our example, suppose that macro MORE_RANDOMNESS
is defined as:
#define MORE_RANDOMNESS(X) ((X = rdtsc()))
In this case, the value of variable ret
will never be the same between
processes P1 and P2.
The code example we’ve been following, and this section on instruction rdtsc
,
were inspired by function
mktemp
in glibc.
Function mktemp
takes a template filename ending in XXXXX and returns an
unique filename (e.g., mktemp(/tmp/fileXXXXXX) = /tmp/file123456
). The
example is basically how mktemp
is implemented for
x86.
Undefined Behavior
Let us now consider that macro MORE_RANDOMNESS
is defined as follows:
#define MORE_RANDOMNESS(X) ((X))
This results in a C program with undefined behavior, as variable ret
is read
without being initialized. Of course, a program written in C that displays
undefined behavior has no meaning and any behavior (of the whole program) is
acceptable.
So, a program that displays any undefined behavior is incorrect and we should
limit our discussion to correct C programs, right? Well. Some programs have
intentional undefined behavior. For instance, OpenSSL initializes a random
data generator with an uninitialized byte
array,
which provides a small amount of extra entropy as the runtime value of the
uninitialized array is hard to predict. This is by design and can be disabled
by setting flag -DPURIFY
when building
OpenSSL.
Using a -DPURIFY
build in production did lead to a vulnerability in
Debian worthy of being
mentioned in an xkcd comic. Still, the most
recent version of OpenSSL (1.1.0) uses -DPURIFY
by
default.
Getting back to our example, as with rdtsc
, this means that variable ret
has
a different value between P1 and P2, depending on the “garbage” contents of
the uninitialized stack buffer. However, note that running the same program
compiled by the same compiler with the same optimization level on P1 and P2
may lead to the same “garbage” being left on the stack, which hides this
divergence. Still, changing any of those variables (using different compilers
or different optimization levels) makes processes P1 and P2 diverge.
Multi-threading
Processes that use several threads have access to another form of non-determinism in the shape of thread scheduling. For instance, we can imagine a program in which two threads race to grab a lock, and then each thread opens a file. Which thread wins the race depends on the scheduler, and may be different between P1 and P2. This issue can be solved with deterministic scheduling or by imposing a total order between system calls, as Varan does.