2023-02-18

Dask LocalCluster Fails to compute random.random above 300Mio data points

I wanted to create some random data for later benchmarking. The chunks need to be configured this way as I want to calculate the rfft later.

However, the sampling of the random data fails as soon as I am around (and above) 300 million data points. The code works fine in local mode. The code works fine when I store the samples directly into a zarr array. The size at which the code breaks is consistent across multiple shapes and chunk sizes. It also does not depend on initialising the cluster with different values.

Following is an minimal example producing the error, please be advised, that the code is working with an array of size=(60, 4_000_000). However, using the slightly bigger array, leads to error.

cluster = dd.LocalCluster(n_workers=1, threads_per_worker=10, memory_limit='30GB')
client = dd.Client(cluster)
# print(client)

RNG_da = da.random.RandomState(42)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()

client.close()
cluster.close()

The same error occurs using LocalCluster() without parameters:

cluster = dd.LocalCluster() 
client = dd.Client(cluster)

RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
client.close()
cluster.close()

However, not specifying or only using the Client works. So all of the versions below work:

RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
client = dd.Client(processes=False)
RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
with dask.config.set(scheduler='processes'):
    RNG_da = da.random.RandomState(1212)
    _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
    print(_.shape)
with dask.config.set(scheduler='threads'):
    RNG_da = da.random.RandomState(1212)
    _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
    print(_.shape)
with dd.LocalCluster(n_workers=1, threads_per_worker=10, memory_limit='15GiB') as cluster, dd.Client(cluster) as client:
        RNG_da = da.random.RandomState(1212)
        _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).persist()
        print(_.shape)

Can it have something to do, with calling the sampling in a multi-processes environment, since using client = Client(process=True) results in this [...]return self.socket.recv_into(buf, len(buf)) OSError: [Errno 22] Invalid argument error.


Here is the error trace, however, I interrupted the program, since it usually runs super long...:

<Client: 'tcp://127.0.0.1:53084' processes=1 threads=10, memory=27.94 GiB>
2023-02-11 18:54:44,007 - distributed.scheduler - ERROR - Couldn't gather keys {"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)": [‘tcp://127.0.0.1:53089'],

[...]

"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 1)": ['tcp://127.0.0.1:53089'], "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)": ['tcp://127.0.0.1:53089']} state: ['memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory'] workers: ['tcp://127.0.0.1:53089']
NoneType: None
2023-02-11 18:54:44,007 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:53089 -> None
Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/tornado/iostream.py", line 973, in _handle_write
    num_bytes = self.write_to_fd(self._write_buffer.peek(size))
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/tornado/iostream.py", line 1146, in write_to_fd
    return self.socket.send(data)  # type: ignore
ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/worker.py", line 1768, in get_data
    response = await comm.read(deserializers=serializers)
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.\_\_class\_\_.\_\_name\_\_}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:53089 remote=tcp://127.0.0.1:53096>: ConnectionResetError: [Errno 54] Connection reset by peer
2023-02-11 18:54:44,009 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)
NoneType: None
2023-02-11 18:54:44,009 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 2)

[...]

NoneType: None
2023-02-11 18:54:44,011 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)
NoneType: None
2023-02-11 18:54:44,013 - distributed.client - WARNING - Couldn't gather 21 keys, rescheduling {"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 2)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 5, 0)": (‘tcp://127.0.0.1:53089',),

[...]
"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 1)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 1)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)": ('tcp://127.0.0.1:53089',)}

^C


No comments:

Post a Comment