Martin Vahi veebilog: 2011

If a problem is not parallelizable, but the computation is extensive enough for the hardware to fail to complete it on time in a single thread, then probably the only way to implement a soft-real-time system is to calculate the answer before it is needed.

The time spent on computation, data transfer, etc., introduces a latency between an input and the output.

In general, whenever there is a latency between an input data and the output data and the parallelization of the computation does not decrease the latency enough, the next thing to try, probably in conjunction with the parallelization, is the ahead-of-time computing.

The idea is that one calculates a whole bunch of answers in parallel and later, when the actual input data is known, picks the answer, the output, that has been derived from the input that is closest to the real input. The rest of the answers are discharged, thrown away.

As of April 2011 I haven't tried it yet, but may be the ahead-of-time computing could be used for speeding applications up even on single-CPU systems. On single-CPU systems one would perform the ahea-of-time computation in the background, in a low priority thread. Tasks that can be run in the background, for example some of the game artificial intelligence and just-in-time compilation, seems to be good candidates for the ahead-of-time computing, because they are not that time critical.

One might even offload some of the background computation to computers on a local area network. If most of the computation results get thrown away anyway and there's not even a guarantee that the bunch of the ahead-of-time calculated answers contains a usable answer, then it does not matter, if some of the answers in the bunch are not found, completed, calculated, on time. That entails that even old-ish and slow-ish computers can contribute to the speed of an application that runs on a modern computer, provided that their computational power to electric power ratio is acceptable.

This article is not intended to be a full-blown tutorial. It is meant to be a relatively concise set of notes, quick reference, with some rehearsal to get the reader's mind to the "track".

Just like any Boolean logic operation (read: digital circuit) can be assembled from Boolean inversion and conjunction, all of the threading API, except the pausing and awakening of sleeping threads, can be assembled if a thread-blocking sleep function exists and a mechanism for forking threads exists.

Once the sleep and fork exist, the rest is all subject to subjective taste, except the pausing and awakening of sleeping threads. The remaining part of this article is just one, somewhat inefficient and largely improvable, version of "the rest".

To create a threading API that does not support the pausing and awakening of sleeping threads I add 2 things:

1) critical sections
2) thread collision detection

The critical sections are used for maintaining consistency of the values of a set of variables.

For the sake of rehearsal, or getting "in the mood": if the set of variables consists of an organism type and its coating, let's say, organism==giraffe, coating==fur, then if the set gets updated to a parrot, i.e. first organism:=parrot, and after that coating:=feathers, then before the value of the coating is updated, there's a situation, where organism==parrot, but the coating==fur. It's better that no thread reads the inconsistent values of the pair, (organism,coating), before the whole pair has been updated and that's where the critical sections become useful.

Thread collision detection is for handling edge-situations, where no thread has yet locked the critical section, but more than one thread wants to enter it.

To allow the capabilities of multi-CPU hardware to be leveraged as much as possible, one should avoid putting the threads to a wait, idle, state. To minimize the waiting time of threads, one wants to maximize the number of threads that can concurrently access the shared set of variables, i.e. one wants to allow as many threads to concurrently enter the critical sections as possible. One way to do that is to use 2 types of critical sections:

A) critical section for only reading
B) critical section for reading and writing

The idea is that it's OK for more than one thread to concurrently read the same set of shared variables, as long as there's no thread writing to the set of shared variables at that very same moment. On the other hand, no thread should be reading the set of shared variables while some other thread is writing to them. Neither should there be more than one thread concurrently writing to the set of shared variables.

At the implementation level it means that each set of the shared variables has 2 locks: one set and released by the read-only critical sections and the other set and released by the read-write critical sections. Both of the critical section implementations read both of the locks.

One possible way to implement the thread collision detection is like this:

b_done=false
while(!b_done)
        i_lock++
        if(i_lock==1){
                enter critical section
                i_lock--
                b_done=true
        } else {
                i_lock--
                sleep some semi-random time
        } // else
} // while

The sleeping time of different blocked threads has to differ, preferably change at every iteration, because otherwise they might collide with each other at every iteration, resulting an infinite loop. One way to assure the sleeping time difference is to use some semi-random numbers. This entails that a programming language can have guite functional threading support without having a statistically proper rand function.

Martin Vahi veebilog

7 Apr 2011

Ahead-of-time Computing

11 Feb 2011

How to Bootstrap a Threading API