Releases: MyrtleSoftware/caiman-asr
v1.13.0
Release notes
This release adds low-latency end-pointing to detect the end of utterance. It also caps the latency of the beam search finals at 1.25 seconds which significantly reduces both the finals' latency and the user-perceived tail latencies without impacting WER. Finally, this release also speeds up beam decoding on-GPU by up to 10x.
This release adds:
- End-pointing (docs)
- Capping of the delay between partials and finals via
--beam_final_emission_thresh=1.25
- A batched implementation of the on-GPU beam decoder
- Support for training models in character based languages (tested in Mandarin). This required:
- small tokenizer changes
- support for calculating character error rate (CER) and mixture error rate (MER)
This release also:
- Improves the scheduling of the delay penalty by waiting until the validation WER has dropped before this kicks in (docs)
- Reduces the startup time at the beginning of training by adding a noise-data cache and speeding up both json parsing and tokenization
- Deprecates the 49M param
testing
model configuration and makes the 85M parambase
model the default for training. See supported models - Improves the usability of the live demo client (docs)
- Fixes the emission latency estimation for the beam decoder
- Improves logging during training and evaluation
- Filters utterances shorter than
min_duration: 0.05
s during training
Summary of changes to default args
--delay_penalty="linear_schedule"
instead of"wer_schedule"
--val_batch_size=1024
instead of 256--beam_final_emission_thresh=1.25
added to cap the finals' latency during beam decoding- YAML config: Adds
min_duration: 0.05
seconds to filter out short utterances during training - YAML config: Adds
error_rate: word
which determines the error rate calculated and must be one of{wer|word, cer|char, mer|mixture}
v1.12.0
Release notes
This release reduces model latencies significantly with the addition of delay-penalty training and speeds up training by up to 2.0x (base
model on an 8xA100 node). It also speeds up greedy decoding by up to 40x (depending on the dataset, model and machine).
This release:
- Reduces the solution's latency. For a breakdown of the system latencies see user-perceived latency docs. Specifically this release:
- Adds support to train with a delay-penalty, either fixed or on a linear schedule, to encourage early emissions of tokens. The algorithm is turned on by default, see docs here
- Adds support to calculate emission latency during training and validation. Turn on with
--calculate_emission_latency
- Speeds up training, (see training times docs) via:
- Reducing RNN-T loss' memory consumption by performing log_softmax in-place
- Running data preparation on GPU. Default is now
--dali_train_device='gpu'
- Support to train on CPUs with heterogenous cores, more information found here
- Adds support for a batched greedy decoder. This should work by default (use
decode='greedy'
and--val_batch_size=<INT_VALUE>
) - Adds option to floor gradient scaler to help stabilize training. Turn on with
--scaler_lbl2=<FLOAT_VALUE>
- Makes various documentation improvements
v1.11.0
Release notes
This release significantly reduces the best WERs (details here) by adding beam search and expanding the data we train the off-the-shelf English models on.
This release:
- Adds support for adaptive beam search and an ngram language model. Turn on with
--decoder=beam
(as opposed to the default--decoder=greedy
). By default, the n-gram language model is trained on training data transcripts - Adds a performance page with breakdown of WERs, latencies and RTS
- Supports training a cased model with this workflow
- For information on evaluating WER of a cased model, run
./scripts/val.sh -h
and see the help section under "WER analysis"
- For information on evaluating WER of a cased model, run
- Reduces
train.sh
startup time by parallelizing normalization and tokenization across multiple threads - Detects OOMs early by first training on the longest utterances in the dataset
- Updates the LibriSpeech preprocessing scripts to make them easier to use on other datasets
- Removes the previously deprecated
legacy/train.sh
andlegacy/val.sh
scripts
v1.10.1
Release notes
This release improves the documentation, including updated latency figures and hardware requirements.
v1.10.0
This release adds a script to run live transcriptions from a user's microphone. See docs: [markdown] [browser]
Full Changelog: v1.9.0...v1.10.0
v1.9.0
Release Notes
This release reduces the WER on long-utterances. Specifically, the best Earnings21 WER when training on open-source data decreases from 21.85% to 15.57% (when training on open-source data. See latest WERs here). There are a few contributing changes but the most important one is the addition of "state resets with overlaps"
This release:
- Adds hosted documentation here
- Improves validation. Specifically it:
- Adds an augmentation technique in which we sample across possible tokenizations during training rather than always using the default sentencepiece tokenization. This is on by default. See here
- Updates the training script to control checkpoint saving and evaluation using steps rather than epochs. This is to fix an issue where users training on large datasets only saved checkpoints very rarely
- Relatedly, the total length of training is now controlled using
--training_steps
rather than--epochs
- Relatedly, the total length of training is now controlled using
- Changes the way in which activations are normalized at the input to the RNN-T. Specifically:
- Streaming norm is replaced with normalization using precomputed mean and stddev of training data mel-bins
- This change is made because streaming norm was only used at inference-time only and resulted in some WER degradation
- See here
- Makes the following miscellaneous changes:
- Renames repository and python library to CAIMAN-ASR to match product name
- Reduces the time to start training by >50% by parallelizing the transcript tokenization
- Updates code structure to make it easier to navigate
train.sh defaults: summary of changes
--training_steps=100000
(instead of--epochs=100
)--sr_segment=15
,--sr_overlap=3
(addition of state resets)--max_inputs_per_batch=1e7
(reduces validation VRAM usage)- yaml:
sampling: 0.05
(adds tokenizer sampling) - yaml:
stats_path: /datasets/stats/STATS_SUBDIR
(adds dataset stats normalization)
v1.8.0
Release notes
This release adds a number of features and increases the training speed by 1.1-2.0x depending on the {model size, hardware} combination.
This release:
- Changes the train.sh and val.sh scripts' API so that args are now passed as named command line arguments instead of environment variables (
--num_gpus=2
instead ofNUM_GPUS=2
)- This is so that the arguments are now spell-checked by the scripts: previously if you set
NUM_GPU=2
(no pluralGPU
s), the scripts would silently fall back to the default rather than alerting the user that the provided arg didn't exist - The scripts
scripts/legacy/train.sh
andscripts/legacy/val.sh
still use the former API but these do not support features introduced after v1.7.1, and they will be removed in a future release
- This is so that the arguments are now spell-checked by the scripts: previously if you set
- Increases training throughput (see updated training times):
- Adds batch splitting. This involves splitting the encoder/prediction batches into smaller sub-batches that can run through the joint network & loss w/o going out-of-memory. This results in higher GPU utilisation and is described in more detail here. See
--batch_split_factor
- Uses fewer DALI dataloading processes per core during dataloading when training with
--num_gpus
> 1. See--dali_processes_per_cpu
- Adds batch splitting. This involves splitting the encoder/prediction batches into smaller sub-batches that can run through the joint network & loss w/o going out-of-memory. This results in higher GPU utilisation and is described in more detail here. See
- Adds background noise augmentation using CAIMAN-ASR-BackgroundNoise. See
--prob_background_noise
- Standardizes the WER calculation. Hypotheses and transcripts are now normalised with the Whisper EnglishSpellingNormalizer before WERs are calculated as described here. This is on by default but can be turned off by setting
standardize_wer: false
in the yaml config - Makes the following miscellaneous changes:
- Adds ability to validate on directories of files using
--val_txt_dir
and--val_audio_dir
as described here - Removes the valCPU.sh script. Validation on cpu is now performed by passing the
--cpu
flag to val.sh - Bumps PyTorch from 2.0 -> 2.1 and Ubuntu from 20.04 -> 22.04
- Reduces audio volume during narrowband downsampling in order to reduce clipping and improve WER. See
--prob_train_narrowband
- Adds ability to validate on directories of files using
train.sh defaults: summary of changes
--half_life_steps=10880
(up from 2805)--prob_background_noise=0.25
. By default background noise is now added to 25% of utterances--dali_processes_per_cpu=1
- yaml:
normalize_transcripts: true
- yaml:
standardize_wer: true
Full Changelog: v1.7.1...v1.8.0
v1.7.1
This release makes small changes. Specifically it adds:
- Narrowband (8 kHz) audio augmentation (off by default). Use with
PROB_TRAIN_NARROWBAND
- Training profiling (off by default). Use with
PROFILER=true
- Ability to train on subset of data via
N_UTTERANCES_ONLY
Full Changelog: v1.6.0...v1.7.1
v1.7.1 patch
This patch to v1.7.0:
- Uses top instead of htop for logging cpu usage when
PROFILER=true
(to avoid truncation with a large number of CPUs) - Correctly sets version numbers
v1.6.0
v1.6.0
This release adds support for a new 196M parameter model a.k.a. "large", improves WER on long-utterances, increases training speed and makes a number of smaller changes. For a summary of the {base, large}
inference performance, WER and training times please refer to the top-level-README.
This release:
- Adds the
large
model configuration - Adds 'Random State Passing' (RSP) as in Narayanan et al., 2019. On in-house validation data this improves WER on long-utterances by ~40% relative
- Removes the hard-LSTM finetune instructions as we now support soft-LSTMs in hardware
- Makes the following changes to training script defaults:
WEIGHT_DECAY=0.001
->WEIGHT_DECAY=0.01
HOLD_STEPS=10880
->HOLD_STEPS=18000
. We find that this, combined with the change toWEIGHT_DECAY
results in ~5% relative reduction in WERcustom_lstm: false
->custom_lstm: true
in yaml configs. This is required to support RSP
- Increases training speed (see summary):
- by packing samples in loss calculation in order to skip padding computation. This may facilitate higher per-gpu batch sizes
- for
WebDataset
reading by using multiprocessing
- Makes miscellaneous changes including:
- setting of
SEED
in dataloader to make runs deterministic. Previously, data order and weights were deterministic but there was some run-to-run variation due to dither - addition of schema checking to ensure trained and exported model checkpoints are compatible with the downstream inference server
- addition of gradient noise augmentation (off by default)
- switching the order of
WEIGHTS_INIT_SCALE=0.5
andforget_gate_bias=1.0
during weight initialisation so that we now (correctly) initialise the LSTM forget gate bias to 1.0 - code organisation and refactoring (e.g. we add new
Setup
classes to reduce object building repetition) - improvements to Tensorboard launch script
- setting of
v1.5.0
v1.5.0 adds support for evaluation on long utterances, improves logging and makes other small fixes.
This release:
- Adds support for validation on long utterances:
- Adds
NO_LOSS
arg in val.sh script to avoid going OOM (useNO_LOSS=true
) - Uses a faster levenshtein distance calculation
- Unsets
MAX_SYMBOLS_PER_SAMPLE
cap on decoding length in validation scripts
- Adds
- Improves logging:
- Fixes incorrectly scaled loss in tensorboard
- Records configuration and stdout to files
- Adds per-layer weight & grad norm diagnostics
- Removes historical MLPerf logging remnants
- Misc:
- Removes
SAVE_MILESTONES
arg: theCheckpointer
class no-longer deletes any checkpoints - Fixes issues with WebDataset reading: filename parsing and filtering
- Fixes race condition with mel-stats export
- Removes