Boosting performance and concurrency in Python

Python provides a base socket server that got no concurrency support by default, which can be used to create any server including HTTPServer or WSGI applications servers like the wsgiref. You can plugin concurrency support using ThreadingMixIn or ForkingMixIn  this would allow our pure-python server to handle multiple requests by forking another process or starting a new thread while the main thread in the main process keeps accepting requests.
In this post I'm going to introduce my own PooledProcessMixIn and its features over other solutions.

The concept of Pool

BSD mascot with a fork
I've taken a look at the code of those Mix-Ins and found serious performance issue with it as they allocate a new process or new thread each time a request comes to the server. Beside delaying the response waiting for the allocation, it's an open-ended approach (no re-using of those threads or processes). The pool approach is to allocate a number of threads or fork a number of processes at server initialization time then delegate requests to them.
There are some pure-python WSGI servers that supports threads pool like python-paste, but it's only for WSGI not any random socket server.
I've asked on for pool mix-in, and one suggested wrapping concurrent.futures to provide that feature.
The ultimate pool I'm looking for should fork a number of process to form a pool for processes and each one got pool for threads. The key reason we need to fork processes is to overcome Python's GIL problem and utilizing different CPU cores (unlike threads which will always stick to the same core) not to mention that forking is very cheap in UNIX-like operating systems.
In order to handle huge number of concurrent requests we need to create bigger pool, but forking more processes will consume more memory, that's why we start threads in each processes, for example instead of forking 256 processes we can fork 4 processes each having 64 thread.

Generic Pools and queues drawbacks

The generic pools as multiprocessing.Pool and concurrent.futures allow us to run functions asynchronously on the pool, and allow you to pull the result of that function as it runs a callback (as in apply_async) or by returning a future object but in our case we don't care about the result we just want to process the request.
Beside the generic pools, many known servers like Paste use a queue to deliver the request to a thread in the pool.
Queues are high-level IPC tool, the generic multiprocessing pool involves 3 queues (one for tasks, one for parameters and one for results). Pushing objects into a queue implies serializing the parameters into interchangeable form called Pickling, in our case the parameters are request socket object and a tuple of client address (IP address and port). In C the socket is just an integer but the request socket is a higher level object with many methods, I guess they would be part of the serialization process! in other words there will be a fraction of time wasted converting the code of the methods in the socket into interchangeable format and sending it through the queue to the other process.

My alternative implementation with primitive IPC

Using a single semaphore and an event we can accomplish the light-weight super concurrent server pool in pure python.
A semaphore is very basic inter-process-communication tool, that is like a counter of available resources, let's say its value was 1 which means that one resource is available for the consumer, when a consumer acquire it, it's decremented, when a producer wants to signal the availability of a resource it get incremented. Acquiring the semaphore will block until its value is greater than one, for example if the producer process signals that we got two requests it releases the semaphore twice, on the other side the consumer processes are sleeping waiting semaphore availability, in our example only two of them get notified and the rest will keep waiting.
We fork the processes just after we bind to the port, which means that the server socket is available in the child process too. Instead of accepting the request in main processes we signal the availability of it through releasing the semaphore, on the other side one of the processes awaiting the availability of the semaphore will accept the connection getting a new socket to receive the request and send the response but before do any thing we signal that we had accepted it through an event.

Getting it

I've submitted it to PyPi so you can use pypi or easy_install commands
If you find any bug or have any idea please use github issues to report them.

Planned Future Development

I'm planning to be able to manage the size of the pool so that it grows under pressure by reporting the status of the threads on a shared memory to be ready, working, stuck. 

Alternative approach

Let's say we have 4 processes trying to send data to 4 clients, two of them are using very low speed internet, actually those two processes will be sleeping waiting for the network IO.

There is a totally different approach to accomplish concurrency like provided by libevent or libev (used by bjoern). Instead of having multiple threads or processes we have an event-based single-process single-threaded server. A good setup is to run several bjoern instances on several ports and use nginx to load balance on them.

Maybe I should take a look at pyev to make an even better server.


Popular posts from this blog

DIY Docker using Skopeo+OStree+Runc

Multi-host docker cluster using OVS/VxLAN on CentOS 7

Bootstrapping Alpine Linux QCow2 image