Apr 13, 2012

Concurrency as a natural paradigm


Since I'm working with Erlang/OTP and Go and point out their powerful handling of concurrency I've often been asked what's special with it and how it can be used. Additionally many people just mixup concurrency and parallelism. So to make it more clear I have to start by leaving computer — don't panic, I'll return later.

I'm wondering what you're doing right now. Yes, one thing is reading this text. But please spend some seconds looking around. What do you see, what is your environment? Maybe you're sitting at home or at work in your office. Or you're possibly sitting in a café reading this text on your smartphone. Are you alone while the rest of the world is frozen when you started to read this text? No, definitely not. There are people around you, some you're interacting with and some you don't care for. So let's assume you are at work. Especially in our software business we use to work in teams. A number of people, sometimes smaller, sometimes larger, works on a project with different roles and different tasks. Some of those tasks can be handled by one developer alone, but many need the discussions between the colleagues, they have to fit into the rest of the software and be realized in a useful order.

The work in those environments is done concurrently. From an artifacts perspective it may look serialized from the requirements until deployment. But the people responsible for all those artifacts of a project work in an independent but structured way like it's done in house building or factories since many hundreds of years. And that's no pure human behavior, we already know it from nature. Just take a look at bees and ants or all other animals building prides or states. In fact we all have grown with it but so far haven't been able to handle it in our software.

Now times have changed. We not only have languages like Go and Erlang/OTP (and many others more), we also have computers and soon smartphones with a large number of cores. I'm writing this text on a quad-core with 8 hyper-threads, and that's only the beginning. Additionally our software has to handle more and more stuff independently but in a structured way. Think about server applications handling a large number of user sessions where each session may send several overlapping requests, inside the server we're communicating with databases (maybe several due to a business driven mix of traditional relational and NoSQL backends), directory servers or further external systems in case of a SOA. Even more may we've have to take care of a global state as well as time-based or other events.

Today we handle those tasks request by request in a serialized manner. And while we're waiting for the answer of a relative slow I/O based task like reading from a database our request execution kicks one's heels. Concurrent software follows a different idea. While analyzing your problem and designing your solution you have to ask yourself where your dependencies are, what can be handled independently and where the access have to be synchronized. So you break up your code in independent components which you now run and compose with the help of your selected language. In Go those are goroutines communicating via channels, in Erlang/OTP processes sending messages into their mailboxes. Goroutines and processes are lightweight threads and a system can run hundreds of thousands.

Let's show it with a little scenario. You have to read data from a source, transform it into an internal representation for further processing, enrich it with data from an additional external source, modify data in two further external targets, at last transform the enriched data into the target format and write it to a target in order — no real uncommon job. Let's now do a first naive implementation (pseudo language):

// One serialized job execution.
fun Job(source, addSource, addTargetA, addTargetB, target) {
    raw = source.Read()
    internal = Modify(raw)
    enriched = addSource.Enrich(internal)
    modificationA = ExtractModificationA(enriched)
    modificationB = ExtractModificationB(enriched)
    addTargetA.Write(modificationA)
    addTargetB.Write(modificationB)
    result = Transform(enriched)
    target.Write(result)
}

Nice job, step by step. With a lot of waiting due to the I/O and maybe longer processing steps. So let's speed it up by spawning it several times.

// Starting our job a number of times.
fun Starter(source, addSource, addTargetA, addTargetB, target, count) {
    for i = 1 to count {
        spawn Job(source, addSource, addTargetA, addTargetB, target)
    }
}

Wow, now with full speed, Starter(…, 10) brings ten times the power! But eh, stop, wait a moment. They all read in parallel from our source. Does it care for synchronization? And if so other jobs have to wait while one is reading. And the same trouble with the additional source, shit. And why do we have to wait for those two modification with their writings. We don't get any result, so we could go on to the transformation and the writing to the target. And here once again the synchronization problem. Additionally how do we handle the in-order writing and how do we know that all is done? Our starter returns immediately after all spawnings. (sigh)

So let's try a better approach. Not with brute-force, but more intelligent. Here we design our methods and functions so that they can read from channels or write to channels to pass data between them.

// Pipelined job execution.
fun Job(source, addSource, addTargetA, addTargetB, target) {
    channel raw
    channel enriched
    channel multicast
    channel transform
    channel transformed
    channel modificationA
    channel modifiedA
    channel modificationB
    channel modifiedB
    channel done


    spawn source.Read(raw)
    spawn addSource.Enrich(raw, multicast)
    spawn fun () {
        for e in multicast {
            send e to transform
            send e to modificationA
            send e to modificationB
        }
    }()
    spawn Transform(transform, transformed)
    spawn target.Write(transformed, done)
    spawn ExtractModificationA(modificationA, modifiedA)
    spawn addTargetA.Write(modifiedA)
    spawn ExtractModificationB(modificationB, modifiedB)
    spawn addTargetA.Write(modifiedB)


    receive done
}

Uuuh! Looks a bit more difficult in the beginning? That's mostly due to my funny pseudo language. (smile) You just have to visualize it. The process source.Read() reads continuously out of the source writes the data into the raw channel. There's no need for synchronization here. And while the next chunk of data is read the process addSource.Enrich() reads from raw and writes it result to multicast. This one and the following anonymous function as process are needed to read from multicast and write to transform, modificationA and modificationB. This allows to handle the transformation and both modifications in parallel, here once again passing the data from process to process using a channel. I assume in my little pseudo runtime that the channels are buffered and spawned processes continue their work if their spawning process ends — you see I'm mixing Go and Erlang/OTP here. (smile) So we come to the final trick. While both modifications paths are fire and forget the transformation has a channel named done. This is for signalization of the end of processing. After the last data is read, enriched, transformed and written the process target.Write() sends a signal via done and our job function can return.

We see, we win a wonderful way of little processes doing their work almost independently and interacting via communication. There are still possible problems, e.g. addSource.Enrich() may block due to an internal error or an I/O problem. So you may need a kind of error signalization. Or at least time-outs like Erlang/OTP has in its receive statement and Go easily create with a <-time.After() in a select statement. Like finding a proper design you have to take care for those kinds of errors. There's still no free lunch. But this — safety in concurrent applications — may be a topic in another blog entry.

No comments:

Post a Comment