-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Dockerfile
of MLflow Kubernetes examples
#2472
Conversation
Update for cython fix
@crcrpar Could you review this PR if you have time? |
@0x41head Thanks for the PR. The CI failure would be fixed by merging the master branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to execute the commands in examples/kubernetes/mlflow/README.md
, but the worker pod failed as follows. It seems that examples/kubernetes/mlflow/pytorch_lightning_distributed.py
uses a legacy interface of val_percent_check
. Do you have any idea?
(venv) mamu@HideakinoMacBook-puro mlflow % kubectl get pod
NAME READY STATUS RESTARTS AGE
mlflow-0 1/1 Running 0 41s
postgres-0 1/1 Running 0 41s
study-creator-92jgf 0/1 Completed 0 41s
worker-b2mh5 0/1 Error 2 41s
worker-nrnb2 0/1 Error 2 41s
(venv) mamu@HideakinoMacBook-puro mlflow % kubectl logs worker-b2mh5
pytorch_lightning_distributed.py:128: ExperimentalWarning: MLflowCallback is experimental (supported from v1.4.0). The interface can change in the future.
callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
[W 2021-03-16 23:18:31,031] Trial 4 failed because of the following error: TypeError("__init__() got an unexpected keyword argument 'val_percent_check'")
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
value_or_values = func(trial)
File "pytorch_lightning_distributed.py", line 106, in objective
callbacks=[metrics_callback],
File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'
Traceback (most recent call last):
File "pytorch_lightning_distributed.py", line 128, in <module>
callbacks=[MLflowCallback(tracking_uri="http://mlflow:5000/", metric_name="val_accuracy")],
File "/usr/local/lib/python3.7/site-packages/optuna/study.py", line 394, in optimize
show_progress_bar=show_progress_bar,
File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 76, in _optimize
progress_bar=progress_bar,
File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 163, in _optimize_sequential
trial = _run_trial(study, func, catch)
File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 268, in _run_trial
raise func_err
File "/usr/local/lib/python3.7/site-packages/optuna/_optimize.py", line 217, in _run_trial
value_or_values = func(trial)
File "pytorch_lightning_distributed.py", line 106, in objective
callbacks=[metrics_callback],
File "/usr/local/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 39, in insert_env_defaults
return fn(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'val_percent_check'
Update to latest.
@HideakiImamura , I am unable to reproduce your errors in my local environment. Neither
|
Sorry for the late reply. The
What if you run |
Thank you for commenting. I looked into the matter and realized you were absolutely correct. However, after running the same set of commands on the current master optuna branch, I got the same errors.
which leads me to believe that these errors might not be caused by this PR. |
In the current master branch, the worker pods fail due to the following errors. The error reason seems to be different from the above error. We need to fix both of them in another PRs. I'll create another issue to be fixed. (Thanks to this PR, I was able to notice the bug. Thanks!)
|
Sorry, I didn't compare the logs of the two branches. I will look into this error ASAP. |
Sure @HideakiImamura 👍 |
@HideakiImamura , a quick change from
|
Thanks for the investigation. I think we should fix both of bugs in another PR, and after that go back to this PR. The reason of |
Merge with master
Merge with master
@HideakiImamura #2514 fixes the issues we faced during this PR.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update and sorry for the late reply. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with kubernetes but confirmed that jobs were successfully launched on my local minikube with the Dockerfile. Thank you @0x41head for updating the dependency!
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2472 +/- ##
==========================================
- Coverage 91.40% 91.38% -0.02%
==========================================
Files 135 135
Lines 11330 11330
==========================================
- Hits 10356 10354 -2
- Misses 974 976 +2 ☔ View full report in Codecov by Sentry. |
Motivation
#2049
Description of the changes
Bumped up the PyTorch to latest stable versions.