Process Pauses
If you won't pay attention, but will cost you a lot
This write up is inspired by Chapter 8: troubles with distributed systems - Martin Kleppmann.
In this article, we will be discussing the impact of “process pauses” in a distributed system. Apart from Unreliable distributed network and clocks, unbounded and unplanned pauses in a distributed system can make things go haywire.
Designing a distributed system is way more complicated than designing a single system data engine. There are many tools which are available in single machine to help us deal with multi-threading. Since there is a shared memory, we have to have tools which can prevent race-conditions and make thread-safe code. In case you are wondering, the tools are like mutexes, semaphores, atomic counters, lock-free data structures, blocking queues and so on. I still remember in my first job interview I was asked to implement lockless-linked list, I was clueless, lol.
I am sure that you must have figured out the issue in distributed systems, no tools like these because no shared memory!
We have to understand that meaning of real-time changes with domain. Embedded systems have a totally different meaning about real-time as compared to the meaning of real-time on web.
Since there are so many moving parts in distributed systems, unforeseen and unbounded pauses in the processing is inevitable. In any mission critical system, this issues needs to be paid a lot of attention. If you are a SpaceX distributed systems engineer you have to make sure that you design stuff which responds in time!
People tend to over-focus on the software they design and skip the environment in which the software will operate.
A real-time guarantee can be provided only if the whole stack supports the software - Operating system, programming language, Garbage collection, VM migration etc.
Garbage Collection Tuning
What if the garbage collector Stops the world and collects all those unreferenced objects? How do you handle this?
One idea can be that you get the knowledge that a particular node is down because of GC and you let other nodes takes over. So a warning message from a node that “I am going to do GC now” can make the distributed systems more reliable.
Another idea is that you do GC for small-lived objects and periodically do a planned restart of nodes so that the full GC is not required for long-lived objects.
These can reduce the “pause process” impact but it is still not 100% efficient.
VM Migration
In the multi-tenant cloud setup, VM migrations can happen. If this is not accounted for then the distributed system might see unresponsive or unbound response time taking nodes.
SIGSTOP
There can be various reasons because of which the system admin can kill your job on a node. detecting and dealing with the process pause here might require some serious effort.
Final Thoughts
All-in all I would say that designing a system which guarantees real-time performance is extremely economically costly and hard to achieve. In most of the cases your software would be considered to serve the purpose if it is “high performant” rather than real-time.
Weekend Reads
Please feel free to reach out to me if you need help with data science or data engineering at your organisation, I would be more than happy to help.
If you are liking my articles and want me to write more then feel free to buy me a coffee! 😁
I hope that you enjoyed this read, if so then please share this article with your friends, let’s build a solid community. I will be back next week with another well thought/researched article delivered straight in your inbox. See you!