Welcome, Guest! Log In | Create Account

Performance and scalability

From pythomnic3k

Jump to: navigation, search

Contents

Pythomnic3k applications performance and scalability

Pythomnic3k is in Python3.0 and the question of performance is one of the first to arise whenever Python is mentioned.

Pythomnic3k application performance ballpark figure

Under 100 requests per second when doing any processing worth mentioning. Pythomnic3k distributed application scales well horizontally, which is easy, fast and cheap but not free nor infinite sort of scalability - losses on communications between components prevent linear growth. YMMV.

Performance issues in Python

Raw performance measured in something/sec is indeed not the strong side of Python, but not because "it is an interpreter", as it is frequently (and wrongly) presumed. As far as being an interpreter goes, Python is no more an interpreter than, say, Java. Performance problems in Python in fact root in two things:

  1. The downside of the extreme language flexibility - that everything is reevaluated at runtime. Simply put, in Python everything is virtual, every access to anything requires a lookup at runtime.
  2. The global interpreter lock, which has to be acquired by any running thread at many occasions. Such lock inhibits effective parallel thread execution even though native OS threads may be used.

The first problem you can't fight - it is also the language's feat and if you dispose of it - you lose Python as a language. Buying a faster CPU is as usual the best way to deal with it and apparently results in linear vertical scaling.

The second problem is worse. The GIL is a subject of constant debates, but it is still there in Python 3.0 and it is going to stay. Its effect is essentially in that multiple threads have to very frequently resynchronize. For simplicity, one may think that Python threads are scheduled by the language runtime and only one thread at a time has the CPU. The exception to this rule is that multiple threads can be pending for I/O or network completion or anywhere "outside the language". So, a Python thread runs for a while, then it is either interrupted by the language runtime and forced to yield to another Python thread or it hits an I/O operation and OS scheduler puts it to sleep and wakes another thread.

Python multithreaded application performance is still acceptable if the number of threads is not too high, a few dozen at most. The typical approach to allocate a thread for every request or session does not work in Python. As the number of threads grows, the overhead becomes unacceptably high.

One may discard the notion of threads altogether by using a different approach to parallel execution, but this will complicate matters for the developer who will have to organize the application code in curious ways. Pythomnic3k is designed to be simple, which among other things means that implementing request processing should also be simple for the developer. Pythomnic3k maintains the illusion of a single thread executing request processing application code and therefore needs the language-supported threads.

Multithreading in Pythomnic3k

Pythomnic3k is a multithreaded application with simple and strict threading policy. In essence, there are two kinds of threads in Pythomnic3k:

  1. Interface threads. Each interface has its own separate thread which spends most of the time in a blocking call to a library or OS kernel accepting incoming requests or polling periodically.
  2. Worker threads. Those threads are allocated at runtime from fixed size thread pools as necessary. As the load grows, more threads get allocated. If the load gets higher, no more threads are allocated, but the work gets queued up. If the load drops, the allocated threads are released in a short while thus returning the application to the idle state.

Note the following consequences:

  • An idle Pythomnic3k cage consumes very little resources, because all the thread pools are empty and the interface threads are blocked in the kernel.
  • The number of concurrent requests that a cage can be processing at any single time is limited with the maximum thread pool size (thread_count in config_interfaces.py).
  • Any number of concurrent requests can be accepted, but as their number grows above the thread pool size, they are queued up pending for processing and cage response time therefore increases.

Pythomnic3k application scalability

As you would not expect a monolithic multithreaded Python application to perform well under extreme load, it is also true for any single Pythomnic3k cage. Pythomnic3k cages don't not scale vertically on individual basis, no matter how fast CPU you have, it will not be used effectively.

Pythomnic3k application is all about horizontal scaling. Applications in Pythomnic3k consist from multiple cages, even of multiple instances of the same cage. To push the performance to the limit and improve hardware utilization, you partition the application into multiple cages. Each cage is a separate OS process and its threads are scheduled independently and don't suffer the excessive overhead.



Interestingly, this approach also favors application redundancy and hence robustness and fault-tolerance.

Performance related configuration

The most crucial parameter is thread_count in cage's config_interfaces.py. It specifies the default maximum number of threads in any thread pool. By default it is 10. This parameter has the following effects:

  1. No more than 10 requests can be concurrently processed by a cage. More incoming requests will be queued. This is because all request processing is performed by the single pool of threads, called "interfaces thread pool". It does not matter which interface introduced the request, they all use the same pool.
  2. Incoming RPC calls arrive through "rpc" interface, therefore they contend with the requests arriving from "regular" interfaces, and there can't be more than 10 running at a time.
  3. No more than 10 outgoing RPC calls can be simultaneously initiated by a cage. More outgoing requests will be queued. This is because all outgoing RPC calls are performed by another thread pool.
  4. Similarly, no more than 10 simultaneous outgoing requests to any resource can be initiated by a cage. More outgoing requests will be queued.

Note that the threads are allocated only on demand and released when they are no longer used, therefore you can set this parameter higher without generally losing anything.