Skip to content

Fundamental incompatibility of Threading and Multiprocessing (Python) #230

@martinpakosch

Description

@martinpakosch

Hi @knipknap, hi community,

the last days - after switching some old environments from Python 2.7 with Exscript 2.6.3/2.6.22 to Python 3.11 with Exscript 2.6.28 - we have encountered some very strange and unpredictable deadlocks while using the exscript script to run some Exscript templates. First we thought these issues are related to our new environment, then maybe to some bugs with Locks, handled by Exscript, but finally we found out, that this is a fundamental issue in Python! More about this later...

Exscript is designed internally to rely on threading for some of his mayor functionalities, like the MainLoop in the WorkQueue, Jobs and some pipeline related stuff. See https://github.com/search?q=repo%3Aknipknap%2Fexscript%20threading.Thread&type=code. Additionally, around 12 years ago, there was support for multiprocessing introduced (e.g., e43f6b7, 0baf858). From that point, Exscript by default behaved differently for the two major use-cases:

  1. When using the exscript script to just execute Exscript templates, Exscript was started in multiprocessing mode. This means essentially:
    a. one main process
    b. many internal threads
    c. one multiprocessing.Process for each job (=host)
  2. When using the Exscript API in Python for more complex scenarios, everything was pinned to threading mode by default, which was good, because then we get essentially:
    a. one main process
    b. many internal threads
    c. one threading.Thread for each job (=host)

So, now you might wonder why all this is a big deal. ;-) Well, the problem is, that according to latest changes in Python 3.12, we found out, that the combination of multiprocessing and threading is not safe/stable for any POSIX system and also not considered stable by the CPython implementation! To not replicate all the stuff I found out about this fundamental issue in Python, I would like to share these articles/posts, that I found - they should explain the problem:

  1. https://pythonspeed.com/articles/faster-multiprocessing-pickle/
  2. https://pythonspeed.com/articles/python-multiprocessing/
  3. https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
  4. https://docs.python.org/3/library/os.html#os.fork
  5. https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555/2
  6. multiprocessing's default posix start method of 'fork' is broken: change to `'forkserver' || 'spawn' python/cpython#84559

Quick solution:
Now, coming back to our issue, we were able to eliminate the deadlocks by changing the default in the exscript script to threading.
See:

mode = 'multiprocessing',

If most of you have not encountered such deadlocks until now in the use-case 1, then you maybe have been lucky as we for the past 10 years running our older environments! But it does not change anything about the fundamental issue here.

Long-term solution:
The change above affects only the exscript script and should be good enough as a quick fix. On the long term, @knipknap @bigmars86, you should think about a fundamental fix - if that is possible at all - or about dropping multiprocessing support for Exscript. What I found out so far regarding possible solutions:

  1. The default method for new processes in Python on Linux is fork, relying on os.fork(), which creates a copy of the parent thread with almost(!) all states/data - see the articles above explaining everything. This will be dropped as a default in Python 3.14, making spawn the new default. But spawn and forkserver create clean child processes and fully rely on e.g. pickling to move data down to these childs - in case Exscript fundamentally relies on full process copies.
  2. I tried to do this in the exscript script with the current code base according to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.set_start_method with both alternate methods and failed. Exscript is then unable to pickle some local scope objects - among possible other issue as I stopped digging here... Example:
Traceback (most recent call last):
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
    self.child.start(to_self)
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 94, in start
    base.start(self)
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_make_process_class.<locals>.process_cls'

# After "fixing" this by creating static public custom classes for Process and Thread in Exscript/workqueue/job.py,
# I encountered the next one... stopped digging here. ;-)

Traceback (most recent call last):
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
    self.child.start(to_self)
  File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 152, in start
    super(Process, self).start()
  File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object '_prepare_connection.<locals>._wrapped'
  1. For me this is beyond my skills - I am happy to understand this stuff until this point to be honest. :D So I assume to fix this - basically to make Exscript compatible with the spawn or forkserver method for multiprocessing - will cost some bigger efforts, unless you already know where the make your hands dirty.

That's from my side. Hope this helps in any way.
Cheers, Martin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions