Luís Pina

Archive | RSS

Matching Linux Processes Through ptrace


When I presented my recent work on Varan recently, more than one person asked me a variant of the following question:

I’ve been using ptrace to force the exact same execution on two processes by ensuring both read the same data from their system calls. Still, at some point, they start behaving differently (they diverge). Why?

ptrace is a feature of the Linux kernel. Quoting it’s manual page:

The ptrace() system call provides a means by which one process (the “tracer”) may observe and control the execution of another process (the “tracee”), and examine and change the tracee’s memory and registers. It is primarily used to implement breakpoint debugging and system call tracing.

When the tracee issues a system call, the tracer gets notified with the number of the system call and its arguments. That alone is quite useful. For instance, strace uses ptrace to dump the sequence of system calls that a process issues.

Besides observing the tracee, ptrace also allows the tracer to attach the tracees memory. As a result, the tracer can emulate the tracee’s system calls without actually executing them.

For instance, let us consider that we have two processes, P1 and P2, running the same program. We can use ptrace attach both processes and copy the results of all the system calls from P1 to P2. Given that the kernel mediates all inputs that a program ever receives, this way we ensure that the two processes get the same inputs.

Let us consider now a simple example. The following function generates a random number by XOR-ing the current time, and the process PID.

#define MORE_RANDOMNESS(X) ((X) = 0)

unsigned long random() {
    unsigned long ret;


    struct timeval t;
    ret ^= t.tv_sec ^ t.tv_usec;

    int pid = getpid();
    ret ^= (unsigned long)pid;

    return ret;

This function issues two system calls: gettimeofday and getpid. Using ptrace as described above, we can ensure that both processes obtain the same time and the same PID. Therefore, variables t, pid, and ret should match between processes P1 and P2. Right?


In this post, I explain the main reasons why, according to my experience.

Virtual System Calls

If you run the program that I show above under strace on an x86 machine, you will notice the absence of system call gettimeofday from the trace. Why?

System call gettimeofday in x86 is a virtual system call. The Linux kernel uses virtual system calls to accelerate common system calls that may execute outside of the kernel. The idea is that such “system calls” execute as a simple function call, rather than an expensive context switch to kernel mode. The Linux kernel exposes a virtual dynamic shared object (vDSO) as an ELF file mapped in every process and discoverable through the auxiliary vector. Then, at runtime, the C standard library (libc) scans the auxiliary vector, finds the vDSO, and calls those functions instead of issuing system calls. Of course, as programs do not issue system calls directly and use libc instead, this is all completely transparent to the program using such virtual system calls.

The speed difference between virtual system calls and regular system calls is quite noticeable. This program issues 1.000.000 gettimeofday system calls, first through libc (and thus using virtual system calls) and then manually through hand-written assembly. Here’s the output on my machine:

> gcc -O0 example.c
> time ./a.out   # system call
./a.out    0.04s user 0.08s system 97% cpu 0.123 total
> time ./a.out 0 #virtual system call
./a.out a  0.03s user 0.00s system 91% cpu 0.033 total

That’s a 3x difference!

Getting back to why ptrace fails, in this case there is no system call for ptrace to intercept. As a result, P1 and P2 get different values for t, thus generating different values for ret.

glibc get_pid

glibc is the GNU implementation of the C standard library (libc), and the de-facto libc used in GNU/Linux. Until recently, glibc optimized system call getpid by caching the PID when initializing (before the program’s main function is called). Then, function getpid simply reads the cached PID without issuing any system call.

The program I show above uses getpid to generate the random number. In our scenario, if we wait until P1 and P2 reach their respective main function to start matching their system calls through ptrace, we miss the caching of the PID. Therefore, each process gets a different value for variable pid.

There’s yet another surprise in how glibc caches the PID. Instead if issuing the expectable system call getpid, glibc uses the rather unlikely system call set_tid_address for that. So, when using strace to detect the point in the program execution where the PID is read and cached, look out for system call set_tid_address.

This approach of caching the PID has gathered some criticism. Linus Torvalds himself uses colorful language to describe how he feels about this optimization. Recently, glibc removed this optimization, which is not present since, and including, version 2.25 (released on 5/Feb/2017).


The rationale behind synchronizing two processes at the level of system calls assumes that the programs that the processes execute are deterministic. This is a reasonable assumption, given how programs typically gather any non-determinism they use vie the operating system through system calls (e.g., by reading file /dev/random, or by getting the date through system call gettimeofday). As a result, the programs themselves are deterministic modulus the inputs from the operating system.

However, and unfortunately, programs can exbibit non-determinism without interacting with the operating system. This is a basic limitation not only of ptrace, but of any other method that matches system calls between processes.

Let’s look at a few examples of such non-determinism that I found in practice.

rdstc Instruction

The x86 instruction rdtsc reads the value of the timestamp counter, which is a 64-bit register that increases monotonically at every clock cycle since the processor was reset. It was introduced with the Pentium processor and it was a great way to get accurate performance timing for programs. However, it is not usable for that purpose nowadays anymore, due to out-of-order execution, several logical cores on the same physical CPU, context switching between threads and processes, etc.

Still, some existing code uses this instruction to generate low-quality random data very and monotonically increasing timestamps very fast. For instance, going back to our example, suppose that macro MORE_RANDOMNESS is defined as:

#define MORE_RANDOMNESS(X) ((X = rdtsc()))

In this case, the value of variable ret will never be the same between processes P1 and P2.

The code example we’ve been following, and this section on instruction rdtsc, were inspired by function mktemp in glibc. Function mktemp takes a template filename ending in XXXXX and returns an unique filename (e.g., mktemp(/tmp/fileXXXXXX) = /tmp/file123456). The example is basically how mktemp is implemented for x86.

Undefined Behavior

Let us now consider that macro MORE_RANDOMNESS is defined as follows:

#define MORE_RANDOMNESS(X) ((X))

This results in a C program with undefined behavior, as variable ret is read without being initialized. Of course, a program written in C that displays undefined behavior has no meaning and any behavior (of the whole program) is acceptable.

So, a program that displays any undefined behavior is incorrect and we should limit our discussion to correct C programs, right? Well. Some programs have intentional undefined behavior. For instance, OpenSSL initializes a random data generator with an uninitialized byte array, which provides a small amount of extra entropy as the runtime value of the uninitialized array is hard to predict. This is by design and can be disabled by setting flag -DPURIFY when building OpenSSL. Using a -DPURIFY build in production did lead to a vulnerability in Debian worthy of being mentioned in an xkcd comic. Still, the most recent version of OpenSSL (1.1.0) uses -DPURIFY by default.

Getting back to our example, as with rdtsc, this means that variable ret has a different value between P1 and P2, depending on the “garbage” contents of the uninitialized stack buffer. However, note that running the same program compiled by the same compiler with the same optimization level on P1 and P2 may lead to the same “garbage” being left on the stack, which hides this divergence. Still, changing any of those variables (using different compilers or different optimization levels) makes processes P1 and P2 diverge.


Processes that use several threads have access to another form of non-determinism in the shape of thread scheduling. For instance, we can imagine a program in which two threads race to grab a lock, and then each thread opens a file. Which thread wins the race depends on the scheduler, and may be different between P1 and P2. This issue can be solved with deterministic scheduling or by imposing a total order between system calls, as Varan does.

Usages of sun.misc.Unsafe within Rubah


Rubah is a Dynamic Software Updating system for Java that works on the stock Oracle HotSpot JVM, does not add any measurable overhead when running a program, and performs dynamic software updates efficiently.

In this post, I explain how Rubah uses the low-level unsafe operations available in class sun.misc.Unsafe (which I shall refer to as the unsafe API) to implement the optimizations that make it so efficient. I start by explaining what the unsafe API is and how to use it, then I describe the object memory layout of the Oracle HotSpot JVM and how to explore it with the unsafe API, then I discuss how Rubah does that to improve its performance, and finally I propose how to make some of the unsafe operations safer in a way that is compatible with Rubah.

What is sun.misc.Unsafe?

The class sun.misc.Unsafe is a proprietary API that enables a Java program to escape the control of the JVM and perform potentially unsafe operations, like direct memory manipulation. Here is a list of interesting methods that this API has (more documentation available here):

public class Unsafe {
	// use reflection instead, this method throws an exception if the calling
	// class is not on the bootstrap classpath
	public static Unsafe getUnsafe();

	// static field/Object field/array manipulation utilities
	public Object staticFieldBase    (Field f);
	public long   staticFieldOffset  (Field f);
	public long   objectFieldOffset  (Field f);
	public int    arrayBaseOffset    (Class c);
	public int    arrayIndexScale    (Class c);

	// reference-type field manipulation
	public Object getObject (Object o, long offset);
	public void   putObject (Object o, long offset, Object x);

	// int-type field manipulation (repeated for every other primitive type)
	public int    getInt    (Object o, long offset);
	public void   putInt    (Object o, long offset, int x);

	// compare-and-swap object/int/long
	public boolean compareAndSwapObject (Object o, long offset, Object expected, Object x);
	public boolean compareAndSwapInt    (Object o, long offset, int expected,    int x);
	public boolean compareAndSwapLong   (Object o, long offset, long expected,   long x);

	// allocate instance without running constructors
	public Object allocateInstance (Class cls);

	// try entering a monitor, fail with false instead of blocking if not able to
	public boolean tryMonitorEnter (Object o);


For instance, the following code sets an entire integer array to 1 using the unsafe API to avoid any bounds check:

// For this to work, add the class to the bootstrap classloader by passing
// the option -Xbootclasspath/a:. to the java command
Unsafe u    = Unsafe.getUnsafe();
int   size  = 1 << 16;
int[] array = new int[size];
int   base  = u.arrayBaseOffset(Integer.class);
int   scale = u.arrayIndexScale(Integer.class);

for (int i = 0 ; i < size ; i++)
  u.putInt(array, (base + i * scale), 1);

// Nothing prevents me writting out of bounds:
// u.putInt(array, (base + size * scale), 1);
// Or reading:
// u.getInt(array, (base + size * scale));

While this is faster than regular Java array manipulation, it may lead to out-of-bounds accesses that can be exploited maliciously.

The unsafe API is used extensively throughout the java.util.concurrent package to perform atomic compare-and-swap operations and volatile read/write on arrays. Currently, that is the only way to do those operations. However, the Oracle JVM development team is looking into ways to make this API safe and part of the Java API.

Most of the low-level memory operations can also be made through JNI native code. However, using the unsafe API is more efficient because it avoids the cost of switching context from Java to JNI and back. Besides, the JIT compiles some calls to the unsafe API directly to JVM intrinsics1.

HotSpot JVM object memory layout

There is a code pattern when using the unsafe API to manipulate data in object fields or arrays. The code above shows an example of that pattern: First, it gets a base pointer with method arrayBaseOffset, then, it gets the array scale factor with method arrayIndexScale, and ,finally, it computes the formula base + i * scale to manipulate position i on the array with methods getInt or putInt.

A similar pattern applies to manipulating objects, the following example shows:

// For this to work, add the class to the bootstrap classloader by passing
// the option -Xbootclasspath/a:. to the java command
Unsafe u       = Unsafe.getUnsafe();
LinkedList obj = new LinkedList();
Class c        = LinkedList.class;
Field f        = c.getDeclaredField("first");
long  offset   = u.objectFieldOffset(f);

u.getObject(obj, offset);
u.putObject(obj, offset, null);

// Nothing prevents me from writting a wrong type:
// u.putObject(obj, offset, c);

For manipulating object fields, we do not get a base pointer and a scale factor. We get, instead, the offset of each individual field with method objectFieldOffset. We can then manipulate object fields using methods getObject and putObject. Note that these are the same methods as in the previous example: If field first had type int, we would use methods getInt and putInt.

Both these code patterns map directly to how the JVM lays objects in memory. The following figure shows how objects look in memory (memory addresses of variables/fields from the code above are writen in a C-like style with a preceding &):

Given the similarity between the HotSpot and the OpenJDK, I shall use the OpenJDK names in the description of the object memory layout, with links to the relevant header files in the OpenJDK source. Every object starts with a fixed sized header that has two fields, as defined by the header file src/share/vm/oops/oop.hpp:

Arrays also have a header that starts with the same two fields, followed by an extra field, as defined by header file src/share/vm/oops/arrayOop.hpp:

Manipulating the object model

Rubah uses the unsafe API to manipulate the metadata on the object header. This is extremely unsafe3 and getting it right was itself an important implementation challenge. This is also brittle because the unsafe API is not part of the standard Java API. Future versions of the HotSpot JVM are free to change how objects are layed out in memory.

In the following, I list all the different ways Rubah manipulates the header using the unsafe API:

// For this to work, add the class to the bootstrap classloader by passing
// the option -Xbootclasspath/a:. to the java command
Unsafe u     = Unsafe.getUnsafe();
Object o     = new Object();

// Valid for a 64 bit JVM with compressed oops
// Might not work for different architectures
long offset  = 1L;

int newHash  = 42;

int identityHashCode = System.identityHashCode(o);
int unsafeHashCode   = u.getInt(o, offset);

assert (identityHashCode == unsafeHashCode != newHash); 
u.putInt(o, offset, newHash);

int identityHashCode = System.identityHashCode(o);
int unsafeHashCode   = u.getInt(o, offset);

assert (identityHashCode == unsafeHashCode == newHash);

class A { /* empty */ }
class B { /* empty */ }

// For this to work, add the class to the bootstrap classloader by passing
// the option -Xbootclasspath/a:. to the java command
Unsafe u     = Unsafe.getUnsafe();
Object a     = new A();
Object b     = new B();

// Valid for a 64 bit JVM with compressed oops
// Might not work for different architectures
long offset  = 8L;

// Do not keep this around for long, it changes over time
int klass    = u.getInt(a, offset);

assert (a instanceof A);
assert (b instanceof B);

u.putInt(b, offset, klass);

assert (a instanceof A);
// If this code gets JITed, the following line may terminate the JVM with SIGSEGV
assert (b instanceof A);

How to make sun.misc.Unsafe safer

Useful as it may be, sun.misc.Unsafe is a proprietary API that will disappear or be modified in future releases of the HotSpot JVM. Rubah relies on features of the unsafe API that might disappear. So, in this section, I propose some alternatives to make those features safe so that they can be made part of the standard Java API.

  1. From wikipedia:

    In compiler theory, an intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it substitutes a sequence of automatically generated instructions for the original function call, similar to an inline function. Unlike an inline function though, the compiler has an intimate knowledge of the intrinsic function and can therefore better integrate it and optimize it for the situation. This is also called builtin function in many languages.

    For instance, calling System.identityHashCode is compiled to an intrinsic operation that just reads the hash code from the object header in a few native instructions, rather than to a method call. 

  2. Locking in the JVM is complex, using both biased and thin locks, which I do not discuss in this post for the sake of simplicity. From the JVM documentation:

    -XX:+UseBiasedLocking Enables a technique for improving the performance of uncontended synchronization. An object is “biased” toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.

    Thin locks are explained in the paper Thin locks: featherweight Synchronization for Java 

  3. To work with a different JVM, Rubah needs to be ported to use the memory layout of objects on that JVM. For instance, Rubah can be ported to the Jikes RVM , which has a similar object memory layout, but all the sun.misc.Unsafe operations have to be rewritten to use instead VM Magic.

    We argue that Rubah is portable between JVMs because it does not require any modification to the JVM itself. Other DSU systems for Java require a custom GC algorithm and are much harder to port to different JVMs. 

  4. The only way to perform compare-and-swap in a Java program without using the unsafe API is to use class java.util.concurrent.AtomicReference ,which is internally implemented using the unsafe API

  5. The steady-state overhead of the type-erasure AND hash-code field together was around 8%, and not 10% as the reader might expect.  2

Presenter software for PDF slideshows



I use Latex to write all my technical documents. I use it also to make slides that I show at my presentations. So, I end up with a PDF that contains the slides.

The problem

To show a PDF slideshow, we require some support from the PDF reader software. At the very minimum, it needs to support fullscreen, which the vast majority does. Some go a little further. For instance,Okular shows the current slide number and a visual representation of how many slides are still left.

But that’s it. If we take a look at other presentation software, such as as Libre Office’s Impress, we see that it takes advantage of a second monitor to show a presenter’s console that features: The current and next slide, the time elapsed and remaining, the notes that we wrote for each slide, and some other information.

Using two monitors to present your slides is a very common usage scenario (laptop connected to a overhead projector) and the presenter screen is really, really useful. I can’t stress that enough.

The (incomplete) solutions

Let’s take a look at what software is there available specifically for presenting PDF slides, splitted into three classes:

The first class of programs bring some “eye-candy” to your presentations. I have used impressive, and it works really well. It pre-loads all the slides, so changing slides is really fast. It also animates the slide transition. A fast crossfading between slides looks great, more elaborate transitions just looks like showing off. I have not tried any of the others, but I guess they all should do more or less the same.

Pdf-presenter-console and pdf-presenter support a presenter console. However, none has all the features that I value. One supports notes but does not support time information, the other supports time information but does not support notes.

My solution

I could not find any PDF presenter program with all the features that I would like to have. So I made my own!

I present you the brand new open-pdf-presenter. Version 0.1 was release yesterday! You can find on the wiki instructions about building and running it.

Here are some screenshots:


Currently, it displays a presenter console with the current slide, the next slide, the elapsed/remaining time, and how many slides are left. In the near future, it will support notes on a separate file and some other nice features from existing software:

It is open-source, so feel free to inspect and modify my code. It uses git as the revision control system. If you make an useful and working patch, just use git to format that patch and email it to me.

I welcome all feedback (bug reports, feature requests, patches, trolling, rantsdeath threats) so feel free to leave your own. I’ll keep you posted when I make another release.