When I presented my recent work on Varan recently, more than one person asked me a variant of the following question:

I’ve been using ptrace to force the exact same execution on two processes by ensuring both read the same data from their system calls. Still, at some point, they start behaving differently (they diverge). Why?

ptrace is a feature of the Linux kernel. Quoting it’s manual page:

The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

When the tracee issues a system call, the tracer gets notified with the number of the system call and its arguments. That alone is quite useful. For instance, strace uses ptrace to dump the sequence of system calls that a process issues.

Besides observing the tracee, ptrace also allows the tracer to attach the tracees memory. As a result, the tracer can emulate the tracee’s system calls without actually executing them.

For instance, let us consider that we have two processes, P1 and P2, running the same program. We can use ptrace attach both processes and copy the results of all the system calls from P1 to P2. Given that the kernel mediates all inputs that a program ever receives, this way we ensure that the two processes get the same inputs.

Let us consider now a simple example. The following function generates a random number by XOR-ing the current time, and the process PID.


#define MORE_RANDOMNESS(X) ((X) = 0)

unsigned long random() {
    unsigned long ret;

    MORE_RANDOMNESS(ret);

    struct timeval t;
    gettimeofday(&t);
    ret ^= t.tv_sec ^ t.tv_usec;

    int pid = getpid();
    ret ^= (unsigned long)pid;

    return ret;
}

This function issues two system calls: gettimeofday and getpid. Using ptrace as described above, we can ensure that both processes obtain the same time and the same PID. Therefore, variables t, pid, and ret should match between processes P1 and P2. Right?

Wrong

In this post, I explain the main reasons why, according to my experience.

Virtual System Calls

If you run the program that I show above under strace on an x86 machine, you will notice the absence of system call gettimeofday from the trace. Why?

System call gettimeofday in x86 is a virtual system call. The Linux kernel uses virtual system calls to accelerate common system calls that may execute outside of the kernel. The idea is that such “system calls” execute as a simple function call, rather than an expensive context switch to kernel mode. The Linux kernel exposes a virtual dynamic shared object (vDSO) as an ELF file mapped in every process and discoverable through the auxiliary vector. Then, at runtime, the C standard library (libc) scans the auxiliary vector, finds the vDSO, and calls those functions instead of issuing system calls. Of course, as programs do not issue system calls directly and use libc instead, this is all completely transparent to the program using such virtual system calls.

The speed difference between virtual system calls and regular system calls is quite noticeable. This program issues 1.000.000 gettimeofday system calls, first through libc (and thus using virtual system calls) and then manually through hand-written assembly. Here’s the output on my machine:


> gcc -O0 example.c
> time ./a.out   # system call
./a.out    0.04s user 0.08s system 97% cpu 0.123 total
> time ./a.out 0 #virtual system call
./a.out a  0.03s user 0.00s system 91% cpu 0.033 total

That’s a 3x difference!

Getting back to why ptrace fails, in this case there is no system call for ptrace to intercept. As a result, P1 and P2 get different values for t, thus generating different values for ret.

glibc get_pid

glibc is the GNU implementation of the C standard library (libc), and the de-facto libc used in GNU/Linux. Until recently, glibc optimized system call getpid by caching the PID when initializing (before the program’s main function is called). Then, function getpid simply reads the cached PID without issuing any system call.

The program I show above uses getpid to generate the random number. In our scenario, if we wait until P1 and P2 reach their respective main function to start matching their system calls through ptrace, we miss the caching of the PID. Therefore, each process gets a different value for variable pid.

There’s yet another surprise in how glibc caches the PID. Instead if issuing the expectable system call getpid, glibc uses the rather unlikely system call set_tid_address for that. So, when using strace to detect the point in the program execution where the PID is read and cached, look out for system call set_tid_address.

This approach of caching the PID has gathered some criticism. Linus Torvalds himself uses colorful language to describe how he feels about this optimization. Recently, glibc removed this optimization, which is not present since, and including, version 2.25 (released on 5/Feb/2017).

Non-determinism

The rationale behind synchronizing two processes at the level of system calls assumes that the programs that the processes execute are deterministic. This is a reasonable assumption, given how programs typically gather any non-determinism they use vie the operating system through system calls (e.g., by reading file /dev/random, or by getting the date through system call gettimeofday). As a result, the programs themselves are deterministic modulus the inputs from the operating system.

However, and unfortunately, programs can exbibit non-determinism without interacting with the operating system. This is a basic limitation not only of ptrace, but of any other method that matches system calls between processes.

Let’s look at a few examples of such non-determinism that I found in practice.

rdstc Instruction

The x86 instruction rdtsc reads the value of the timestamp counter, which is a 64-bit register that increases monotonically at every clock cycle since the processor was reset. It was introduced with the Pentium processor and it was a great way to get accurate performance timing for programs. However, it is not usable for that purpose nowadays anymore, due to out-of-order execution, several logical cores on the same physical CPU, context switching between threads and processes, etc.

Still, some existing code uses this instruction to generate low-quality random data very and monotonically increasing timestamps very fast. For instance, going back to our example, suppose that macro MORE_RANDOMNESS is defined as:

#define MORE_RANDOMNESS(X) ((X = rdtsc()))

In this case, the value of variable ret will never be the same between processes P1 and P2.

The code example we’ve been following, and this section on instruction rdtsc, were inspired by function mktemp in glibc. Function mktemp takes a template filename ending in XXXXX and returns an unique filename (e.g., mktemp(/tmp/fileXXXXXX) = /tmp/file123456). The example is basically how mktemp is implemented for x86.

Undefined Behavior

Let us now consider that macro MORE_RANDOMNESS is defined as follows:

#define MORE_RANDOMNESS(X) ((X))

This results in a C program with undefined behavior, as variable ret is read without being initialized. Of course, a program written in C that displays undefined behavior has no meaning and any behavior (of the whole program) is acceptable.

So, a program that displays any undefined behavior is incorrect and we should limit our discussion to correct C programs, right? Well. Some programs have intentional undefined behavior. For instance, OpenSSL initializes a random data generator with an uninitialized byte array, which provides a small amount of extra entropy as the runtime value of the uninitialized array is hard to predict. This is by design and can be disabled by setting flag -DPURIFY when building OpenSSL. Using a -DPURIFY build in production did lead to a vulnerability in Debian worthy of being mentioned in an xkcd comic. Still, the most recent version of OpenSSL (1.1.0) uses -DPURIFY by default.

Getting back to our example, as with rdtsc, this means that variable ret has a different value between P1 and P2, depending on the “garbage” contents of the uninitialized stack buffer. However, note that running the same program compiled by the same compiler with the same optimization level on P1 and P2 may lead to the same “garbage” being left on the stack, which hides this divergence. Still, changing any of those variables (using different compilers or different optimization levels) makes processes P1 and P2 diverge.

Multi-threading

Processes that use several threads have access to another form of non-determinism in the shape of thread scheduling. For instance, we can imagine a program in which two threads race to grab a lock, and then each thread opens a file. Which thread wins the race depends on the scheduler, and may be different between P1 and P2. This issue can be solved with deterministic scheduling or by imposing a total order between system calls, as Varan does.