Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DATA_DIR not respected #7

Closed
Pseudomanifold opened this issue Jun 17, 2022 · 11 comments
Closed

DATA_DIR not respected #7

Pseudomanifold opened this issue Jun 17, 2022 · 11 comments

Comments

@Pseudomanifold
Copy link
Collaborator

Pseudomanifold commented Jun 17, 2022

Hi Bastain!
I tried installing without poetry and running your code.
Everything worked...
I am not able to figure out how to set the DATA_DIR , as the code is looking for the data in the wrong directory.
Here is the output that I get

(togl) mohit@user-Default-string:~/TOGL$ python topognn/train_model.py --model TopoGNN --dataset DD --batch_size 20 --lr 0.0007
Using backend: pytorch
/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:68: UserWarning: No correct seed found, seed set to 3526443079
  warnings.warn(*args, **kwargs)
Global seed set to 3526443079
Traceback (most recent call last):
  File "topognn/train_model.py", line 150, in <module>
    main(model_cls, dataset_cls, args)
  File "topognn/train_model.py", line 59, in main
    dataset.prepare_data()
  File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
    return fn(*args, **kwargs)
  File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 48, in wrapped_fn
    return fn(*args, **kwargs)
  File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/topognn/data_utils.py", line 549, in prepare_data
    with open(os.path.join(DATA_DIR, 'Benchmark_idx', self.name+"_"+section+'.index'), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/topognn/../data/Benchmark_idx/DD_train.index'

Originally posted by @mohit-kumar-27 in #6 (comment)

@Pseudomanifold
Copy link
Collaborator Author

Simplest fix I'd recommend is setting DATA_DIR yourself in TOGL/topognn/__init__.py. You can point that to a directory that you want to use.

As a fix from our side, we could use an env variable or refer to another path. What do you think @edebrouwer, @ExpectationMax, @mi92?

@Pseudomanifold
Copy link
Collaborator Author

@mohit-kumar-27 any updates on this? Does the proposed workaround solve your problem?

@mohit-kumar-27
Copy link

Hello Bastain,
Not checked till now, stuck up with some urgent work. Will try running again this weekend and update you possibly on Sunday/Monday

@mohit-kumar-27
Copy link

This is how I modified the TOGL/topognn/init.py

import os.path
from enum import Enum, auto
DATA_DIR='/home/mohit/TOGL/data/'
#DATA_DIR = os.path.join(os.path.dirname(file), '..', 'data')

class Tasks(Enum):
"""Valid tasks."""

GRAPH_CLASSIFICATION = auto()
NODE_CLASSIFICATION = auto()
NODE_CLASSIFICATION_WEIGHTED = auto()

Still the code searches in the wrong directory and gives the following error
FileNotFoundError: [Errno 2] No such file or directory: '/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/topognn/../data/Benchmark_idx/DD_train.index'

@Pseudomanifold
Copy link
Collaborator Author

Pseudomanifold commented Jun 23, 2022 via email

@mohit-kumar-27
Copy link

mohit-kumar-27 commented Jun 24, 2022

Hi Bastain,

I tried running the code by reinstalling the project and DATA_DIR error was resolved, but now I get the following error
raise CommError("Permission denied, ask the project owner to grant you access")
wandb.errors.CommError: Permission denied, ask the project owner to grant you access
wandb: ERROR Internal wandb error: file data was not synced

I created a new wandb account and gave the api key, when the program asked me to, then I got this error

This is the full output

wandb: Currently logged in as: mohitk2 (use wandb login --relogin to force relogin)
wandb: wandb version 0.12.19 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: ERROR Error while calling W&B API: project not found (<Response [404]>)
Thread SenderThread:
Traceback (most recent call last):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/lib/retry.py", line 102, in call
result = self._call_fn(*args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 133, in execute
six.reraise(*sys.exc_info())
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/six.py", line 719, in reraise
raise value
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 127, in execute
return self.client.execute(*args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
request.raise_for_status()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://api.wandb.ai/graphql

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/apis/normalize.py", line 24, in wrapper
return func(*args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 922, in upsert_run
response = self.gql(mutation, variable_values=variable_values, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/lib/retry.py", line 118, in call
if not check_retry_fn(e):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/util.py", line 727, in no_retry_auth
raise CommError("Permission denied, ask the project owner to grant you access")
wandb.errors.CommError: Permission denied, ask the project owner to grant you access

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 55, in run
self._run()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 105, in _run
self._process(record)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal.py", line 292, in _process
self._sm.send(record)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 181, in send
send_handler(record)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 604, in send_run
self._init_run(run, config_value_dict)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/sender.py", line 626, in _init_run
server_run, inserted = self._api.upsert_run(
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/apis/normalize.py", line 62, in wrapper
six.reraise(CommError, CommError(message, err), sys.exc_info()[2])
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/six.py", line 718, in reraise
raise value.with_traceback(tb)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/apis/normalize.py", line 24, in wrapper
return func(*args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 922, in upsert_run
response = self.gql(mutation, variable_values=variable_values, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/lib/retry.py", line 118, in call
if not check_retry_fn(e):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/util.py", line 727, in no_retry_auth
raise CommError("Permission denied, ask the project owner to grant you access")
wandb.errors.CommError: Permission denied, ask the project owner to grant you access
wandb: ERROR Internal wandb error: file data was not synced
Problem at: /scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py 155 experiment
Traceback (most recent call last):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 761, in init
run = wi.init()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 520, in init
backend.cleanup()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 167, in cleanup
self.interface.join()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 836, in join
_ = self._communicate(record)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 545, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 550, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown
wandb: ERROR Abnormal program exit
Traceback (most recent call last):
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 761, in init
run = wi.init()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 520, in init
backend.cleanup()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/backend/backend.py", line 167, in cleanup
self.interface.join()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 836, in join
_ = self._communicate(record)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 545, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 550, in _communicate_async
raise Exception("The wandb backend process has shutdown")
Exception: The wandb backend process has shutdown

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "topognn/train_model.py", line 150, in
main(model_cls, dataset_cls, args)
File "topognn/train_model.py", line 82, in main
dirpath=wandb_logger.experiment.dir,
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
return get_experiment() or DummyExperiment()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 48, in wrapped_fn
return fn(*args, **kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
return fn(self)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 155, in experiment
self._experiment = wandb.init(
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 798, in init
six.raise_from(Exception("problem"), error_seen)
File "", line 3, in raise_from
Exception: problem

@Pseudomanifold
Copy link
Collaborator Author

You can start train_model.py with WANDB_MODE=disabled or WANDB_MODE=offline, i.e.:

$ WANDB_MODE=offline poetry run python train_model.py

@ExpectationMax @edebrouwer: should we solve this more generically and remove the team name from the WandB logger? Or potentially default to a tensorboard logger?

@mohit-kumar-27
Copy link

I ran the following from my terminal
(togl) mohit@user-Default-string:~/TOGL$ wandb offline

(togl) mohit@user-Default-string:~/TOGL$ python topognn/train_model.py --model TopoGNN --dataset DD --batch_size 20 --lr 0.0007

I get the following error:

Traceback (most recent call last):
File "topognn/train_model.py", line 152, in
main(model_cls, dataset_cls, args)
File "topognn/train_model.py", line 98, in main
trainer.fit(model, datamodule=dataset)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch

self.accelerator.start_training(self)

File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 107, in start_training
mp.spawn(self.new_process, **self.mp_spawn_kwargs)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
process.start()
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 58, in _launch
self.pid = util.spawnv_passfds(spawn.get_executable(),
File "/scott/mohit/anaconda3/envs/togl/lib/python3.8/multiprocessing/util.py", line 452, in spawnv_passfds
return _posixsubprocess.fork_exec(
**ValueError: bad value(s) in fds_to_keep

wandb: Waiting for W&B process to finish, PID 11662
wandb: Program failed with code 1.**

Could you suggest what needs to be done here?

@Pseudomanifold
Copy link
Collaborator Author

Seems to be a problem with wandb; please try WANDB_MODE=disabled.

PS: Please read and follow these instructions for formatting your messages.

@mohit-kumar-27
Copy link

I tried
(mohit_f) mohit@user-Default-string:~/TOGL$ WANDB_MODE=disabled python topognn/train_model.py --model TopoGNN --dataset DD --batch_size 20 --lr 0.0007
Still getting same error

The issue seems to be with pytorch_lightning and multiprocessing

@Pseudomanifold
Copy link
Collaborator Author

Hmm, might be better to open a separate issue with pytorch-lightning. You could also check whether you can change the Trainer class (use a different strategy for training, as described in the documentation). See also PyTorch issue 538.

Closing this issue for now since the original problem has been resolved. Please feel free to open another issue for anything else related to TOGL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants