尝试递归比较列表时,带有多处理模块的RuntimeError

[英]RuntimeError with multiprocessing module when trying to recursively compare lists


I'm generating a list filled with sublists of randomly generated 0s and 1s, and then trying to compare each list with every other list to determine their similarity, efficiently.

我正在生成一个列表,其中填充了随机生成的0和1的子列表,然后尝试将每个列表与每个其他列表进行比较,以有效地确定它们的相似性。

I know that my code works with a single process (i.e. without involving multiprocessing, but once I start involving multiprocessing.Pool() or multiprocessing.Process() everything starts to break.

我知道我的代码只使用一个进程(即不涉及多处理,但是一旦我开始涉及multiprocessing.Pool()或multiprocessing.Process(),一切都开始破坏。

I want to compare how long a single process would take compared to multiple processes. I've tried this with threading, but a single process actually ended up taking less time, probably due to the Global Interpreter Lock.

我想比较单个流程与多个流程相比需要多长时间。我已尝试使用线程,但单个进程实际上最终花费的时间更少,可能是由于Global Interpreter Lock。

Here's my code:

这是我的代码:

import difflib
import secrets
import timeit
import multiprocessing
import numpy

random_lists = [[secrets.randbelow(2) for _ in range(500)] for _ in range(500)]
random_lists_split = numpy.array_split(numpy.array(random_lists), 5)


def get_similarity_value(lists_to_check, sublists_to_check) -> list:
    ratios = []
    matcher = difflib.SequenceMatcher()
    for sublist_major in sublists_to_check:
        try:
            sublist_major = sublist_major.tolist()
        except AttributeError:
            pass
        for sublist_minor in lists_to_check:
            if sublist_major == sublist_minor or [lists_to_check.index(sublist_major), lists_to_check.index(sublist_minor)] in [ratios[i][1] for i in range(len(ratios))] or [lists_to_check.index(sublist_minor), lists_to_check.index(sublist_major)] in [ratios[i][1] for i in range(len(ratios))]:  # or lists_to_check.index(sublist_major.tolist()) > lists_to_check.index(sublist_minor):
                pass
            else:
                matcher.set_seqs(sublist_major, sublist_minor)
                ratios.append([matcher.ratio(), sorted([lists_to_check.index(sublist_major), lists_to_check.index(sublist_minor)])])
    return ratios


def start():
    test = multiprocessing.Pool(4)
    data = [(random_lists, random_lists_split[i]) for i in range(len(random_lists_split))]
    print(test.map(get_similarity_value, data))


statement = timeit.Timer(start)
print(statement.timeit(1))

statement2 = timeit.Timer(lambda: get_similarity_value(random_lists, random_lists))
print(statement2.timeit(1))

And here's the error:

这是错误:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "timings.py", line 38, in <module>
    print(statement.timeit(1))
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\timeit.py", line 178, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "timings.py", line 32, in start
    test = multiprocessing.Pool(4)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "C:\ProgramData\Anaconda3\envs\Computing Coursework\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
is not going to be frozen to produce an executable.''')
RuntimeError: 
    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

N.B. I have tried using multiprocessing.freeze_support() but it results in the same error. The code also seems to be attempting to run indefinitely, as the error appears over and over again.

注:我尝试过使用multiprocessing.freeze_support()但它会导致相同的错误。代码似乎也试图无限期地运行,因为错误一再出现。

Thanks!

1 个解决方案

#1


0  

The problem is that your top-level code—including the code that creates the child Process—is not protected from being run in the child processes.

问题是您的顶级代码(包括创建子Process的代码)不受保护,不会在子进程中运行。

As the docs explain:, if you're not using the fork start method (and since you're on Windows, you're not):

正如文档解释的那样:如果你没有使用fork start方法(因为你在Windows上,你不是):

Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).

确保新的Python解释器可以安全地导入主模块,而不会导致意外的副作用(例如启动新进程)。

In fact, it's nearly identical to the example that follows that warning. You're launching a whole pool of children instead of just one, but it's the same problem. Every child in the pool tries to launch a new pool, and, fortunately, multiprocessing figures out that something bad is going on and fails with a RuntimeError instead of exponentially spawning processes until Windows refuses to spawn anymore or its scheduler just falls down.

实际上,它与警告之后的示例几乎完全相同。你正在推出一整套儿童而不只是一个,但这也是同样的问题。池中的每个子节点都尝试启动一个新池,幸运的是,多处理器会发现出现错误并且导致RuntimeError而不是指数级生成进程失败,直到Windows拒绝生成或者其调度程序刚刚崩溃为止。

As the docs say:

正如文档所说:

Instead one should protect the “entry point” of the program by using if __name__ == '__main__':

相反,应该使用if __name__ =='__ main__'来保护程序的“入口点”:

In your case, that means this part:

在您的情况下,这意味着这部分:

if __name__ == '__main__':
    statement = timeit.Timer(start)
    print(statement.timeit(1))

    statement2 = timeit.Timer(lambda: get_similarity_value(random_lists, random_lists))
    print(statement2.timeit(1))
智能推荐

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2018/03/29/6fab6b79f55446ddeb39fb853fd2201f.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告