-
Notifications
You must be signed in to change notification settings - Fork 747
Issues: kubeflow/trainer
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
Fix the Coveralls badge for the Go test coverage
area/testing
good first issue
help wanted
kind/bug
#2519
opened Mar 13, 2025 by
andreyvelich
KEP-2401: Determine the tag for torchtune trainer & Add support for multiple accelerators
area/llm
kind/feature
#2518
opened Mar 13, 2025 by
Electronic-Waste
Create DeepSpeed Runtime with Kubeflow Trainer
area/runtimes
kind/feature
#2517
opened Mar 13, 2025 by
andreyvelich
Get and Use TrainingRuntime ApplyConfiguration throughout KF PipelineFramework
kind/feature
#2515
opened Mar 13, 2025 by
tenzen-y
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family
area/llm
area/runtimes
kind/feature
#2510
opened Mar 12, 2025 by
Electronic-Waste
KEP-2401: Create LLM Training Runtimes for Llama 3.1 model family
area/llm
area/runtimes
kind/feature
#2509
opened Mar 12, 2025 by
Electronic-Waste
KEP-2401: Validate fine-tuning configurations in
torch
plugin
area/llm
kind/feature
#2508
opened Mar 12, 2025 by
Electronic-Waste
KEP-2401: Complement
torch
plugin to support torchtune
config mutation
area/llm
kind/feature
#2507
opened Mar 12, 2025 by
Electronic-Waste
KEP-2401: Support mutating dataset preprocessing config in SDK
area/llm
area/sdk
kind/feature
#2506
opened Mar 12, 2025 by
Electronic-Waste
KEP-2401: Support LoRA/QLoRA/DoRA fine-tuning in LLM Trainer V2
area/llm
area/sdk
kind/feature
#2505
opened Mar 12, 2025 by
Electronic-Waste
KEP-2401: Add
TorchTuneConfig
to train()
API
area/llm
area/sdk
kind/feature
#2504
opened Mar 12, 2025 by
Electronic-Waste
Add replicatedJobs.replicas validations to TrainingRuntime and ClusterTrainingRuntime Webhook
kind/feature
#2502
opened Mar 12, 2025 by
tenzen-y
Update Kubeflow Pipeline Framework Diagram and Description with PodNetworkPlugin
kind/documentation
kind/feature
#2497
opened Mar 10, 2025 by
tenzen-y
Migrate Trainer to PodSet and RuntimePolicy in runtime package (InternalAPI)
kind/cleanup
#2495
opened Mar 10, 2025 by
tenzen-y
Add a workflow for publishing Helm charts
area/deployment
good first issue
help wanted
kind/feature
#2488
opened Mar 7, 2025 by
ChenYi015
Decouple UTs between Framework and Plugins packages
area/controller
kind/feature
#2468
opened Mar 3, 2025 by
tenzen-y
2 of 6 tasks
Explore
uv
project manager for Kubeflow Python SDK
area/sdk
good first issue
help wanted
kind/discussion
kind/feature
#2462
opened Feb 28, 2025 by
andreyvelich
KEP-2170: Revisit TrainJob Created condition status type
kind/feature
#2459
opened Feb 28, 2025 by
tenzen-y
Distributed training with mutliple pods, with multi-gpu in each pod
#2456
opened Feb 28, 2025 by
githubthunder
Managing Pod Lifecycle in Distributed Training with TFJob
kind/feature
lifecycle/needs-triage
#2454
opened Feb 27, 2025 by
mnmhouse
Strategies for Deleting Successful Pods without Affecting Task Execution in TFJob
area/controller
kind/bug
#2453
opened Feb 27, 2025 by
mnmhouse
Add unit tests that cover the
pkg/apply
package
area/testing
good first issue
help wanted
#2452
opened Feb 26, 2025 by
astefanutti
Previous Next
ProTip!
Find all open issues with in progress development work with linked:pr.