Migrate cuml exploration notebook #153

jacobtomlinson · 2023-02-15T13:37:31Z

Leverages the docref admonition added in #152. Reviewer may want to just look at 756b9cf or the notebook on its own in ReviewNB.

Closes rapidsai/cloud-ml-examples#207

The notebook seems to be reasonably useful. I updated the instructions from Dask Kubernetes Classic to the Dask Kubernetes Operator and added a little GKE instructions to get folks going. I've also added more prose and tweaked headings to make nav look sensible.

~~However, it seems the performance sweeps fail with a StopIteration exception and I'm a little out of my depth in debugging it.~~

Full traceback

Starting weak-scaling performance sweep for:
 model      : <class 'cuml.dask.ensemble.randomforestregressor.RandomForestRegressor'>
 data loader: <function <lambda> at 0x7fb02fac8f70>.
Configuration
==========================
Worker counts             : [8]
Fit/Predict samples       : 5
Data load samples         : 1
- Max data fraction       : 1.0
Model fit                 : X ~ y
- Response DType          : <class 'numpy.int32'>
Writing results to        : ./taxi_large_random_forest_regression.csv
- Method                  : append

Sampling <1> load times with 8 workers.

100%|██████████| 1/1 [00:00<00:00, 11522.81it/s]

Finished loading <1>, samples, to <8> workers with a mean time of 0.0000 sec.
Sweeping <class 'cuml.dask.ensemble.randomforestregressor.RandomForestRegressor'> 'fit' with <8> workers. Sampling <5> times.


  0%|          | 0/5 [00:06<?, ?it/s]

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/common/part_utils.py in _extract_partitions(dask_obj, client)
    161 
--> 162     raise gen.Return([(first(who_has[key]), part)
    163                       for key, part in key_to_part])

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/common/part_utils.py in <listcomp>(.0)
    161 
--> 162     raise gen.Return([(first(who_has[key]), part)
    163                       for key, part in key_to_part])

/opt/conda/envs/rapids/lib/python3.9/site-packages/toolz/itertoolz.py in first(seq)
    374     """
--> 375     return next(iter(seq))
    376 

StopIteration: 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_280/2978372879.py in <module>
      8 rf_csv_path = f"./{out_prefix}_random_forest_regression.csv"
      9 
---> 10 performance_sweep(client=client, model=RandomForestRegressor,
     11                 **sweep_kwargs,
     12                 out_path=rf_csv_path,

/tmp/ipykernel_280/1927266979.py in performance_sweep(client, model, data_loader, hardware_type, worker_counts, samples, load_samples, max_data_frac, predict_frac, scaling_type, xy_fit, fit_requires_compute, update_workers_in_kwargs, response_dtype, out_path, append_to_existing, model_name, fit_func_id, predict_func_id, scaling_denom, model_args, model_kwargs)
    186         m = model(*model_args, **model_kwargs)
    187         if (fit_func_id):
--> 188             fit_timings = sweep_fit_func(model=m, func_id=fit_func_id,
    189                                              require_compute=fit_requires_compute,
    190                                              X=X, y=y, xy_fit=xy_fit, count=samples)

/tmp/ipykernel_280/1927266979.py in sweep_fit_func(model, func_id, require_compute, X, y, xy_fit, count)
     49             fit_func = partial(_fit_func_attr, X)
     50 
---> 51     return collect_func_time_samples(func=fit_func, count=count)
     52 
     53 

/tmp/ipykernel_280/1927266979.py in collect_func_time_samples(func, count, verbose)
     30     for k in tqdm(range(count)):
     31         with SimpleTimer() as timer:
---> 32             func()
     33         timings.append(timer.elapsed)
     34 

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/ensemble/randomforestregressor.py in fit(self, X, y, convert_dtype, broadcast_data)
    251         """
    252         self.internal_model = None
--> 253         self._fit(model=self.rfs,
    254                   dataset=(X, y),
    255                   convert_dtype=convert_dtype,

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/ensemble/base.py in _fit(self, model, dataset, convert_dtype, broadcast_data)
     99 
    100     def _fit(self, model, dataset, convert_dtype, broadcast_data):
--> 101         data = DistributedDataHandler.create(dataset, client=self.client)
    102         self.active_workers = data.workers
    103         self.datatype = data.datatype

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/common/input_utils.py in create(cls, data, client)
    103         datatype, multiple = _get_datatype_from_inputs(data)
    104 
--> 105         gpu_futures = client.sync(_extract_partitions, data, client)
    106 
    107         workers = tuple(set(map(lambda x: x[0], gpu_futures)))

/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    337             return future
    338         else:
--> 339             return sync(
    340                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341             )

/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    404     if error:
    405         typ, exc, tb = error
--> 406         raise exc.with_traceback(tb)
    407     else:
    408         return result

/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py in f()
    377                 future = asyncio.wait_for(future, callback_timeout)
    378             future = asyncio.ensure_future(future)
--> 379             result = yield future
    380         except Exception:
    381             error = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py in run(self)
    773                             exc_info = None
    774                     else:
--> 775                         yielded = self.gen.send(value)
    776 
    777                 except (StopIteration, Return) as e:

RuntimeError: generator raised StopIteration

review-notebook-app · 2023-02-15T13:37:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…s-dask-cuml-exploration

jacobtomlinson · 2023-02-24T17:41:09Z

This notebook is ready for review when someone has a minute.

betatim · 2023-02-27T10:29:24Z

can we set n_workers = 3 as the default? I used 8 without much thinking and things seemed to get stuck because my cluster only had three nodes. Could have read the instructions to change n_workers but didn't. I think three is the default(??) size of a GKE cluster, so we'd cater to those who don't read/follow instructions
nit: spell out "Visualisation" in the "Vis and Analysis" heading.
I'm running the notebook outside the cluster and when running the estimate_df_rows function I get a bunch of horrible looking warnings. Can we avoid them somehow and/or add a note for the unsuspecting user to let them know that all is well, despite the warnings (it seems like all is well as you get the answer at the end)
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d34f70>: Failed to establish a new connection: [Errno -2] Name or service not known'))
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d36440>: Failed to establish a new connection: [Errno -2] Name or service not known'))
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d35e40>: Failed to establish a new connection: [Errno -2] Name or service not known'))
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d34100>: Failed to establish a new connection: [Errno -2] Name or service not known'))
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d35480>: Failed to establish a new connection: [Errno -2] Name or service not known'))

* the cell after the "Taxi Data Configuration (Medium)" heading has a hard-coded number of workers. It should be referencing `n_workers` I think

Notebook runs and prose reads well beyond the above comments.

jacobtomlinson added 3 commits February 15, 2023 13:26

Add custom admonition for referencing other doc pages

df63ca8

Add image for README

2f90bf9

Migrate cuml exploration notebook

756b9cf

Remove broken CSV data

283ba9e

jacobtomlinson marked this pull request as ready for review February 15, 2023 17:40

Merge branch 'main' of https://github.com/rapidsai/deployment into k8…

2005daa

…s-dask-cuml-exploration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate cuml exploration notebook #153

Migrate cuml exploration notebook #153

Uh oh!

jacobtomlinson commented Feb 15, 2023 •

edited

Loading

Uh oh!

review-notebook-app bot commented Feb 15, 2023

Uh oh!

jacobtomlinson commented Feb 24, 2023

Uh oh!

betatim commented Feb 27, 2023

Uh oh!

Uh oh!

Migrate cuml exploration notebook #153

Are you sure you want to change the base?

Migrate cuml exploration notebook #153

Uh oh!

Conversation

jacobtomlinson commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Feb 15, 2023

Uh oh!

jacobtomlinson commented Feb 24, 2023

Uh oh!

betatim commented Feb 27, 2023

Uh oh!

Uh oh!

jacobtomlinson commented Feb 15, 2023 •

edited

Loading