chore!(train): データセットを分割して一部で評価するようにする #110

sevenc-nanashi · 2025-04-24T15:04:25Z

内容

trainのデータセットを分割して一部で評価するようにします。

スクリーンショット・動画など

（なし）

その他

（なし）

train/config/dummy.yml

train/src/train.py

Hiroshiba · 2025-04-28T11:21:23Z

こちらは今どういう状態でしょうか 👀
コミット履歴見る感じレビューしても大丈夫だったり･･･？
（どっちもお見合い状態になっているともったいないので聞いてみた次第です！ 🙏 ）

sevenc-nanashi · 2025-04-28T11:22:54Z

コミットだけしてレビューリクエストを出し忘れる -> あ、ここリファクタできそう...って感じでした。
もうレビューして大丈夫のはずです。

Copilot

Pull Request Overview

This PR refactors the training process by splitting the original training dataset into distinct training and test subsets for mid-training evaluation. Key changes include:

Splitting the training dataset into train and test datasets using torch.utils.data.random_split.
Adding a new test data loader and evaluator for performing test evaluations alongside the existing evaluation process.
Updating configuration files (Python and YAML) to support the test_ratio parameter.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File	Description
train/src/train.py	Introduces dataset splitting and a separate test evaluator/dataloader.
train/src/config.py	Adds the test_ratio configuration parameter.
train/config/example.yml	Updates example configuration to include test_ratio.
train/config/dummy.yml	Updates dummy configuration to include test_ratio.

train/src/train.py

Hiroshiba · 2025-04-28T12:03:07Z

train/src/train.py

@@ -278,6 +285,12 @@ def train():
        collate_fn=partial(collate_fn, device=device),
        drop_last=True,
    )
+    test_dl = DataLoader(


（このプルリクエストに関係ないのですが）

_dl、なかなか問題ありそうな名前ですね 😇
ここは定義だから問題ないけど、使ってるとこでディープラーニングと勘違いしそう。

Hiroshiba · 2025-04-29T06:04:06Z

train/src/train.py

+def calculate_bleu(
+    label: str,
+    model: Model,
+    evaluator: Evaluator,
+    epoch: int,
+    writer: SummaryWriter,
+) -> Tensor:
+    eval_bleu = evaluator.evaluate(model)
+    writer.add_scalar(f"BLEU/{label}", eval_bleu, epoch)
+    print(f"Epoch {epoch} {label} BLEU: {eval_bleu}")
+    return eval_bleu


calculate_lossは値を返さないのにこっちは値を返しているの、不揃いだなーと感じました！

calculateは値を返すのが正しいと思います。
そしてadd_scalarするのはちょっと関心が違いそう。

calculate_bleuとwrite_bleuに分けるとかですかねぇ。（write_bleuという名前が良いのか若干しっくり来ないけど･･･）
まあ少なくとも今の関数名と関数の形はちょっと変そう！

write_scalarという関数にしましたが、かといってwriteにしてはprintもしてるし...（commit...？）

logが良いかなーと思いましたが、対数とかぶるんですよねー。。。

と思ってChatGPTに聞いてみたらreportを進められました！！良さそう！！
https://chatgpt.com/share/681241d8-e34c-8008-a909-7911abde0150
こういう時わりと AI 君いいの出してくれる印象あるので、活用してみると幅広がるかもです！！
（Copilot君に聞くとかなり便利なはず）

report_scalarにしました。

Copilot

Pull Request Overview

This PR refactors the training script to split the training dataset into training and test portions and evaluates the model on both during training. Key changes include:

Refactoring dataset creation with a new prepare_datasets function that splits the dataset using a test_ratio.
Adding a dedicated test DataLoader and corresponding evaluator to calculate test metrics.
Updating configuration files to include the test_ratio parameter.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

File	Description
train/src/train.py	Implements dataset splitting, adds test DataLoader/evaluator, and updates loss/metric logging.
train/src/config.py	Adds the new test_ratio config field.
train/config/example.yml	Documents the new test_ratio parameter.
train/config/dummy.yml	Updates dummy config to include test_ratio.

Comments suppressed due to low confidence (1)

train/src/train.py:312

Verify that the RAdamScheduleFree optimizer supports an eval() mode as typical PyTorch optimizers do not. If not, removing or replacing this call might prevent potential runtime errors.

        optimizer.eval()

train/src/train.py

Hiroshiba

LGTM！！

関数の名前だけ参考になれば！

train/src/config.py

Co-authored-by: Hiroshiba <[email protected]>

sevenc-nanashi · 2025-05-01T01:53:17Z

マージします。

sevenc-nanashi added 3 commits April 25, 2025 00:03

chore!: trainを分割して一部で評価するようにする

257c65e

fix: 0.1 -> 0.01

e4c4305

fix: 0.1 -> 0.01

e13f869

sevenc-nanashi commented Apr 24, 2025

View reviewed changes

train/config/dummy.yml Outdated Show resolved Hide resolved

sevenc-nanashi added 3 commits April 27, 2025 07:30

fix: dummy.ymlを一応最新に

86831b1

feat: evalを復活

5e48476

refactor: bleu周りを共通化

1cd48f5

sevenc-nanashi commented Apr 28, 2025

View reviewed changes

train/src/train.py Show resolved Hide resolved

fix: bleuを返す

e24a7a0

style: ruff format

240c7b0

sevenc-nanashi requested a review from Copilot April 28, 2025 12:12

Copilot AI reviewed Apr 28, 2025

View reviewed changes

train/src/train.py Outdated Show resolved Hide resolved

Hiroshiba reviewed Apr 28, 2025

View reviewed changes

Hiroshiba reviewed Apr 29, 2025

View reviewed changes

sevenc-nanashi added 2 commits April 30, 2025 22:33

refactor: write_scalarに分割

a5300fc

refactor: データセットの用意を分割

a473e69

Hiroshiba requested a review from Copilot April 30, 2025 15:31

Copilot AI reviewed Apr 30, 2025

View reviewed changes

train/src/train.py Show resolved Hide resolved

Hiroshiba approved these changes Apr 30, 2025

View reviewed changes

train/src/config.py Outdated Show resolved Hide resolved

sevenc-nanashi and others added 2 commits May 1, 2025 10:46

chore: test_ratiowo

cfcf7fd

Co-authored-by: Hiroshiba <[email protected]>

chore: write_scalar -> report_scalar

1ab747c

sevenc-nanashi enabled auto-merge May 1, 2025 01:53

sevenc-nanashi added this pull request to the merge queue May 1, 2025

Merged via the queue into VOICEVOX:main with commit eba693f May 1, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore!(train): データセットを分割して一部で評価するようにする #110

chore!(train): データセットを分割して一部で評価するようにする #110

sevenc-nanashi commented Apr 24, 2025

Hiroshiba commented Apr 28, 2025 •

edited

Loading

sevenc-nanashi commented Apr 28, 2025

Copilot AI left a comment

Hiroshiba Apr 28, 2025

Hiroshiba Apr 29, 2025 •

edited

Loading

sevenc-nanashi Apr 30, 2025

Hiroshiba Apr 30, 2025 •

edited

Loading

sevenc-nanashi May 1, 2025

Copilot AI left a comment

Hiroshiba left a comment

sevenc-nanashi commented May 1, 2025

chore!(train): データセットを分割して一部で評価するようにする #110

chore!(train): データセットを分割して一部で評価するようにする #110

Conversation

sevenc-nanashi commented Apr 24, 2025

内容

関連 Issue

スクリーンショット・動画など

その他

Hiroshiba commented Apr 28, 2025 • edited Loading

sevenc-nanashi commented Apr 28, 2025

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Hiroshiba Apr 28, 2025

Choose a reason for hiding this comment

Hiroshiba Apr 29, 2025 • edited Loading

Choose a reason for hiding this comment

sevenc-nanashi Apr 30, 2025

Choose a reason for hiding this comment

Hiroshiba Apr 30, 2025 • edited Loading

Choose a reason for hiding this comment

sevenc-nanashi May 1, 2025

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Hiroshiba left a comment

Choose a reason for hiding this comment

sevenc-nanashi commented May 1, 2025

Hiroshiba commented Apr 28, 2025 •

edited

Loading

Hiroshiba Apr 29, 2025 •

edited

Loading

Hiroshiba Apr 30, 2025 •

edited

Loading