-
Notifications
You must be signed in to change notification settings - Fork 131
Description
Hi @knipknap, hi community,
the last days - after switching some old environments from Python 2.7 with Exscript 2.6.3/2.6.22 to Python 3.11 with Exscript 2.6.28 - we have encountered some very strange and unpredictable deadlocks while using the exscript script to run some Exscript templates. First we thought these issues are related to our new environment, then maybe to some bugs with Locks, handled by Exscript, but finally we found out, that this is a fundamental issue in Python! More about this later...
Exscript is designed internally to rely on threading for some of his mayor functionalities, like the MainLoop in the WorkQueue, Jobs and some pipeline related stuff. See https://github.com/search?q=repo%3Aknipknap%2Fexscript%20threading.Thread&type=code. Additionally, around 12 years ago, there was support for multiprocessing introduced (e.g., e43f6b7, 0baf858). From that point, Exscript by default behaved differently for the two major use-cases:
- When using the
exscriptscript to just execute Exscript templates, Exscript was started inmultiprocessingmode. This means essentially:
a. one main process
b. many internal threads
c. one multiprocessing.Process for each job (=host) - When using the Exscript API in Python for more complex scenarios, everything was pinned to
threadingmode by default, which was good, because then we get essentially:
a. one main process
b. many internal threads
c. one threading.Thread for each job (=host)
So, now you might wonder why all this is a big deal. ;-) Well, the problem is, that according to latest changes in Python 3.12, we found out, that the combination of multiprocessing and threading is not safe/stable for any POSIX system and also not considered stable by the CPython implementation! To not replicate all the stuff I found out about this fundamental issue in Python, I would like to share these articles/posts, that I found - they should explain the problem:
- https://pythonspeed.com/articles/faster-multiprocessing-pickle/
- https://pythonspeed.com/articles/python-multiprocessing/
- https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
- https://docs.python.org/3/library/os.html#os.fork
- https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555/2
- multiprocessing's default posix start method of
'fork'is broken: change to `'forkserver' || 'spawn'python/cpython#84559
Quick solution:
Now, coming back to our issue, we were able to eliminate the deadlocks by changing the default in the exscript script to threading.
See:
Line 182 in 9d5b035
| mode = 'multiprocessing', |
If most of you have not encountered such deadlocks until now in the use-case 1, then you maybe have been lucky as we for the past 10 years running our older environments! But it does not change anything about the fundamental issue here.
Long-term solution:
The change above affects only the exscript script and should be good enough as a quick fix. On the long term, @knipknap @bigmars86, you should think about a fundamental fix - if that is possible at all - or about dropping multiprocessing support for Exscript. What I found out so far regarding possible solutions:
- The default method for new processes in Python on Linux is
fork, relying onos.fork(), which creates a copy of the parent thread with almost(!) all states/data - see the articles above explaining everything. This will be dropped as a default in Python 3.14, makingspawnthe new default. Butspawnandforkservercreate clean child processes and fully rely on e.g. pickling to move data down to these childs - in case Exscript fundamentally relies on full process copies. - I tried to do this in the
exscriptscript with the current code base according to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.set_start_method with both alternate methods and failed. Exscript is then unable to pickle some local scope objects - among possible other issue as I stopped digging here... Example:
Traceback (most recent call last):
File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
self.child.start(to_self)
File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 94, in start
base.start(self)
File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
reduction.dump(process_obj, buf)
File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_make_process_class.<locals>.process_cls'
# After "fixing" this by creating static public custom classes for Process and Thread in Exscript/workqueue/job.py,
# I encountered the next one... stopped digging here. ;-)
Traceback (most recent call last):
File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
self.child.start(to_self)
File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 152, in start
super(Process, self).start()
File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
reduction.dump(process_obj, buf)
File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_prepare_connection.<locals>._wrapped'
- For me this is beyond my skills - I am happy to understand this stuff until this point to be honest. :D So I assume to fix this - basically to make Exscript compatible with the
spawnorforkservermethod for multiprocessing - will cost some bigger efforts, unless you already know where the make your hands dirty.
That's from my side. Hope this helps in any way.
Cheers, Martin