Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate cuml exploration notebook #153

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented Feb 15, 2023

Leverages the docref admonition added in #152. Reviewer may want to just look at 756b9cf or the notebook on its own in ReviewNB.

Closes rapidsai/cloud-ml-examples#207

The notebook seems to be reasonably useful. I updated the instructions from Dask Kubernetes Classic to the Dask Kubernetes Operator and added a little GKE instructions to get folks going. I've also added more prose and tweaked headings to make nav look sensible.

However, it seems the performance sweeps fail with a StopIteration exception and I'm a little out of my depth in debugging it.

Full traceback
Starting weak-scaling performance sweep for:
 model      : <class 'cuml.dask.ensemble.randomforestregressor.RandomForestRegressor'>
 data loader: <function <lambda> at 0x7fb02fac8f70>.
Configuration
==========================
Worker counts             : [8]
Fit/Predict samples       : 5
Data load samples         : 1
- Max data fraction       : 1.0
Model fit                 : X ~ y
- Response DType          : <class 'numpy.int32'>
Writing results to        : ./taxi_large_random_forest_regression.csv
- Method                  : append

Sampling <1> load times with 8 workers.

100%|██████████| 1/1 [00:00<00:00, 11522.81it/s]

Finished loading <1>, samples, to <8> workers with a mean time of 0.0000 sec.
Sweeping <class 'cuml.dask.ensemble.randomforestregressor.RandomForestRegressor'> 'fit' with <8> workers. Sampling <5> times.


  0%|          | 0/5 [00:06<?, ?it/s]

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/common/part_utils.py in _extract_partitions(dask_obj, client)
    161 
--> 162     raise gen.Return([(first(who_has[key]), part)
    163                       for key, part in key_to_part])

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/common/part_utils.py in <listcomp>(.0)
    161 
--> 162     raise gen.Return([(first(who_has[key]), part)
    163                       for key, part in key_to_part])

/opt/conda/envs/rapids/lib/python3.9/site-packages/toolz/itertoolz.py in first(seq)
    374     """
--> 375     return next(iter(seq))
    376 

StopIteration: 

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_280/2978372879.py in <module>
      8 rf_csv_path = f"./{out_prefix}_random_forest_regression.csv"
      9 
---> 10 performance_sweep(client=client, model=RandomForestRegressor,
     11                 **sweep_kwargs,
     12                 out_path=rf_csv_path,

/tmp/ipykernel_280/1927266979.py in performance_sweep(client, model, data_loader, hardware_type, worker_counts, samples, load_samples, max_data_frac, predict_frac, scaling_type, xy_fit, fit_requires_compute, update_workers_in_kwargs, response_dtype, out_path, append_to_existing, model_name, fit_func_id, predict_func_id, scaling_denom, model_args, model_kwargs)
    186         m = model(*model_args, **model_kwargs)
    187         if (fit_func_id):
--> 188             fit_timings = sweep_fit_func(model=m, func_id=fit_func_id,
    189                                              require_compute=fit_requires_compute,
    190                                              X=X, y=y, xy_fit=xy_fit, count=samples)

/tmp/ipykernel_280/1927266979.py in sweep_fit_func(model, func_id, require_compute, X, y, xy_fit, count)
     49             fit_func = partial(_fit_func_attr, X)
     50 
---> 51     return collect_func_time_samples(func=fit_func, count=count)
     52 
     53 

/tmp/ipykernel_280/1927266979.py in collect_func_time_samples(func, count, verbose)
     30     for k in tqdm(range(count)):
     31         with SimpleTimer() as timer:
---> 32             func()
     33         timings.append(timer.elapsed)
     34 

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/ensemble/randomforestregressor.py in fit(self, X, y, convert_dtype, broadcast_data)
    251         """
    252         self.internal_model = None
--> 253         self._fit(model=self.rfs,
    254                   dataset=(X, y),
    255                   convert_dtype=convert_dtype,

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/ensemble/base.py in _fit(self, model, dataset, convert_dtype, broadcast_data)
     99 
    100     def _fit(self, model, dataset, convert_dtype, broadcast_data):
--> 101         data = DistributedDataHandler.create(dataset, client=self.client)
    102         self.active_workers = data.workers
    103         self.datatype = data.datatype

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuml/dask/common/input_utils.py in create(cls, data, client)
    103         datatype, multiple = _get_datatype_from_inputs(data)
    104 
--> 105         gpu_futures = client.sync(_extract_partitions, data, client)
    106 
    107         workers = tuple(set(map(lambda x: x[0], gpu_futures)))

/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    337             return future
    338         else:
--> 339             return sync(
    340                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341             )

/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    404     if error:
    405         typ, exc, tb = error
--> 406         raise exc.with_traceback(tb)
    407     else:
    408         return result

/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py in f()
    377                 future = asyncio.wait_for(future, callback_timeout)
    378             future = asyncio.ensure_future(future)
--> 379             result = yield future
    380         except Exception:
    381             error = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py in run(self)
    773                             exc_info = None
    774                     else:
--> 775                         yielded = self.gen.send(value)
    776 
    777                 except (StopIteration, Return) as e:

RuntimeError: generator raised StopIteration

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jacobtomlinson jacobtomlinson marked this pull request as ready for review February 15, 2023 17:40
@jacobtomlinson
Copy link
Member Author

This notebook is ready for review when someone has a minute.

@betatim
Copy link
Member

betatim commented Feb 27, 2023

  • can we set n_workers = 3 as the default? I used 8 without much thinking and things seemed to get stuck because my cluster only had three nodes. Could have read the instructions to change n_workers but didn't. I think three is the default(??) size of a GKE cluster, so we'd cater to those who don't read/follow instructions
  • nit: spell out "Visualisation" in the "Vis and Analysis" heading.
  • I'm running the notebook outside the cluster and when running the estimate_df_rows function I get a bunch of horrible looking warnings. Can we avoid them somehow and/or add a note for the unsuspecting user to let them know that all is well, despite the warnings (it seems like all is well as you get the answer at the end)
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
    WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d34f70>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d36440>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d35e40>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d34100>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6000d35480>: Failed to establish a new connection: [Errno -2] Name or service not known'))
* the cell after the "Taxi Data Configuration (Medium)" heading has a hard-coded number of workers. It should be referencing `n_workers` I think

Notebook runs and prose reads well beyond the above comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate dask/kubernetes/Dask_cuML_Exploration_Full.ipynb to deployment docs
2 participants