All About System Timing

Intro

In short, I got tired of re-explaining how to properly time a piece of software. It is in some cases a very non-trivial task. When done poorly, it can lead to very skewed results.

Timing Basics

When trying to figure out if modifications to a program are effective, a user might want to know how much time a program took to execute before and after the modifications. There are dozens if not more methods. Here I'll focus on the ones that involve in system monitoring, that is either using the command line or programmatically. If using an embedded system or an FPGA one could use external signals to query a clock outside of the actual system being measured. When dealing with software system however things are often a bit more complicated.

The first and most obvious way to time something on Linux is using the "time" command. Its simple, easy to use and gives obvious end to end timings of a program. However its not that accurate, and since it only gives end to end timings it doesn't really give a way to time a single function or code section precisely. For this we need something else.

Unfortunately there are several options to choose from at this point, most are operating system or architecture dependent. We'll cover the Linux system calls first, then the OS X system calls followed by architecture specific options. Last we'll go over different ways to use these commands.

Linux provides the clock_gettime function which provides as one of its options a "real time" clock. This function is simple, its easy to use and it provides theoretical nanosecond accuracy with most implementations. As an example of how to call this function, here's the following example:

All that is required to time a segment of code is to call clock_gettime before and after execution of the segment. The only real problem with the clock_gettime function is that it takes a lot of time (relatively speaking). This function takes about 250ns on average to return. What this means is that if the thing you are attempting to time something anywhere close to 250ns in execution time then you will be measuring mostly the variation in time to call clock_gettime and not your code. Don't get me wrong, this system call has many advantages: 1) its relatively portable on Linux machines 2) it is constant across multiple cores. However, if you want less latency for subsequent calls to the timer there are several other methods (covered in the next post). First though, there's a slightly different call to get a "nanosecond" timer on Darwin (OS X) which I'll post an example for next. The mach_absolute_time function returns a time unit that is theoretically CPU / platform specific (see Apple reference here: ref). To get a usable time reference that is comparable to other times we have to get a system conversion factor. The functions that would normally do this don't appear to be present in a header file so we have to do it manually (although if you know different I'd love to hear from you). The next example shows how to call mach_absolute_time and get a usable time in nanoseconds from it.

This method suffers from the same latency issue as clock_gettime, but again it is portable to systems running Darwin (OS X) and it will return the same consistent results across multiple cores. To get times with less latency you'll have to use something called a time stamp counter. Most modern processors provide some mechanism to do this with varying resolutions. In the next post I'll cover how to call the x86 time stamp counter properly. In future posts I'll also talk about ways to provide a consistent time across multiple processor cores and how to reduce timer latency via multi-threading.

More accurate timing with time stamp counters → first up, x86

So you want timings with less latency than clock_gettime or mach_absolute_time can provide. Well, there's a solution. It involves some basic assembly code and a few other higher level systems programming issues that I'll cover. This section will only go over x86 time stamp counters since no kernel modules are needed. In future sections I'll cover some ARM platforms, however many will require kernel modules to be built. Surprisingly, x86 is a bit simpler in this respect and I can cover the pitfalls of a Time Stamp Counter (TSC) a bit easier without covering how to build a kernel module.

First of all, what is a TSC? Simply put its a cycle counter that is able to be read by querying specific registers within the processor. Sadly that's about as specific as I can get in the general case. The TSC is implemented slightly differently on differing x86 processors. In most modern processors, the cycles you're actually reading are from the bus, not the actual processor cycle ticks. A second consideration on modern processors is conversion to and from real time. It is often useful to present results in something easier to digest such as seconds. A clock tick is a unit of time, yes, but it is not constant. Things like frequency scaling and hibernation mode make this conversion a bit difficult. To understand why, here's an example: take a measurement of the TSC at point A, and TSC at point B and consider what happens if the frequency scales from 2.5GHz to 3.0 GHz and then down to 1.3GHz between those two points. The cycle count hasn't changed, so no problem there. What happens when we want to convert to time perceived by the user (obviously time hasn't shifted for them at the same rate as the processor) so a fixed number of seconds has passed. At 3.0 GHz many more cycles occur per second than at 1.3 GHz so when converting we need to know how many cycles per second occur for each of these measurements and convert. But there's no real way to do this without turning off frequency scaling. You can attempt to correct for this by sampling for the frequency repeatedly, but you'll end up chasing your tail if you are trying to take nanosecond scale measurements since the time to find the frequency is far greater than tens of nanoseconds. So, bottom line is this: turn off frequency scaling (the way to do that varies by system so perhaps I'll cover it in a future section). In the next section I'll elaborate on some methods to get a constant time reference across multiple compute cores.

The next consideration when using a time stamp counter is that you are in fact taking a value from a single processor. What happens if the thread or process executing the time stamp instruction migrates to a different core. Even if the counters are theoretically synchronized they won't be exact, so you could have drift of several cycles which would make the readings inaccurate. To fix this problem the thread executing the time stamp instruction must be "pinned" to a single core. Linux provides the sched_setaffinity function. Before reading the time though after calling this function you'll need to wait for the operating system to move your thread or process to the requested core. The best way to do this is call a sched_yield. There are multiple ways to do this, but I'll use the pthread_yield instruction in the example below where the "pinning" process is shown in full.

Now that we've pinned the processor we can get down to what instructions to call for reading the time stamp counter. On x86_64 there are two differing instructions rdtsc and rdtscp. The first is available on almost all systems, but it is not a serializing instruction (that is, this instruction could get re-arranged or issued before or after the code you are attempting to measure). The second ensures that this machine instruction is issued in order without any additional work. AMD and Intel differ in how to handle the ordering issue, with a mfence vs. an lfence instruction. These differences also have performance implications for each processor so its best to use the specific instruction that is recommended for your processor. One thing that is common on the web that is NOT recommended is to issue the cpuid instruction. It will serialize in the same manner as the fence instructions but it is slow and could skew your measurements. Another common mistake is performing the follow on operations to rdtsc outside of the assembly code. If not executed perfectly, GCC will move these values to memory and back to register whereas it could be done entirely in register with no additional moves. In the example below, we have three differing versions of the x86 rdtsc instruction. The first is the rdtscp instruction which should be used if available. The next two are Intel and AMD specific versions. The last is if serialization for some reason is not aquired.

Technically the clobbered registers list is a bit long, but its faster to type for the example by covering all the conditional compilation directives with a single line covering all of them. This code of course must also be inserted within a function to be used. Some or all of this code was taken from my GitHub repository SystemClock.

Constant Time Source via Threading

Under construction.

ARM Time Stamp Counter

Under construction.