-
Notifications
You must be signed in to change notification settings - Fork 2
chore!(train): データセットを分割して一部で評価するようにする #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
こちらは今どういう状態でしょうか 👀 |
コミットだけしてレビューリクエストを出し忘れる -> あ、ここリファクタできそう...って感じでした。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the training process by splitting the original training dataset into distinct training and test subsets for mid-training evaluation. Key changes include:
- Splitting the training dataset into train and test datasets using torch.utils.data.random_split.
- Adding a new test data loader and evaluator for performing test evaluations alongside the existing evaluation process.
- Updating configuration files (Python and YAML) to support the test_ratio parameter.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
train/src/train.py | Introduces dataset splitting and a separate test evaluator/dataloader. |
train/src/config.py | Adds the test_ratio configuration parameter. |
train/config/example.yml | Updates example configuration to include test_ratio. |
train/config/dummy.yml | Updates dummy configuration to include test_ratio. |
@@ -278,6 +285,12 @@ def train(): | |||
collate_fn=partial(collate_fn, device=device), | |||
drop_last=True, | |||
) | |||
test_dl = DataLoader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(このプルリクエストに関係ないのですが)
_dl
、なかなか問題ありそうな名前ですね 😇
ここは定義だから問題ないけど、使ってるとこでディープラーニングと勘違いしそう。
train/src/train.py
Outdated
def calculate_bleu( | ||
label: str, | ||
model: Model, | ||
evaluator: Evaluator, | ||
epoch: int, | ||
writer: SummaryWriter, | ||
) -> Tensor: | ||
eval_bleu = evaluator.evaluate(model) | ||
writer.add_scalar(f"BLEU/{label}", eval_bleu, epoch) | ||
print(f"Epoch {epoch} {label} BLEU: {eval_bleu}") | ||
return eval_bleu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calculate_loss
は値を返さないのにこっちは値を返しているの、不揃いだなーと感じました!
calculate
は値を返すのが正しいと思います。
そしてadd_scalar
するのはちょっと関心が違いそう。
calculate_bleu
とwrite_bleu
に分けるとかですかねぇ。(write_bleu
という名前が良いのか若干しっくり来ないけど・・・)
まあ少なくとも今の関数名と関数の形はちょっと変そう!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_scalarという関数にしましたが、かといってwriteにしてはprintもしてるし...(commit...?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log
が良いかなーと思いましたが、対数とかぶるんですよねー。。。
と思ってChatGPTに聞いてみたらreport
を進められました!!良さそう!!
https://chatgpt.com/share/681241d8-e34c-8008-a909-7911abde0150
こういう時わりと AI 君いいの出してくれる印象あるので、活用してみると幅広がるかもです!!
(Copilot君に聞くとかなり便利なはず)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
report_scalarにしました。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the training script to split the training dataset into training and test portions and evaluates the model on both during training. Key changes include:
- Refactoring dataset creation with a new prepare_datasets function that splits the dataset using a test_ratio.
- Adding a dedicated test DataLoader and corresponding evaluator to calculate test metrics.
- Updating configuration files to include the test_ratio parameter.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
train/src/train.py | Implements dataset splitting, adds test DataLoader/evaluator, and updates loss/metric logging. |
train/src/config.py | Adds the new test_ratio config field. |
train/config/example.yml | Documents the new test_ratio parameter. |
train/config/dummy.yml | Updates dummy config to include test_ratio. |
Comments suppressed due to low confidence (1)
train/src/train.py:312
- Verify that the RAdamScheduleFree optimizer supports an eval() mode as typical PyTorch optimizers do not. If not, removing or replacing this call might prevent potential runtime errors.
optimizer.eval()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
関数の名前だけ参考になれば!
Co-authored-by: Hiroshiba <[email protected]>
マージします。 |
内容
trainのデータセットを分割して一部で評価するようにします。
関連 Issue
スクリーンショット・動画など
(なし)
その他
(なし)