-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The problem with scalars #233
Comments
Hi @kzelias, what is your code doing, exactly? |
It's just a task over hydra. import pytorch_lightning as pl
from omegaconf import OmegaConf
from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager
from clearml import Task
CONFIG_NAME = "fastconformer_287_start_tune_b128_lr2e-5"
@hydra_runner(config_path="../../cfg_train/conformers/cvm", config_name=CONFIG_NAME)
def main(cfg):
task = Task.init(project_name="ap-models", task_name=CONFIG_NAME)
logger = task.get_logger()
trainer = pl.Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
asr_model = EncDecHybridRNNTCTCBPEModel(cfg=cfg.model, trainer=trainer)
# Initialize the weights of the model from another model, if provided via config
print("------INITING FROM PRETRAIN------")
asr_model.maybe_init_from_pretrained_checkpoint(cfg)
print("------INITED------")
logging.info(f'MODEL train_ds config: {asr_model.cfg.train_ds}')
logging.info(f'MODEL optim config: {asr_model.cfg.optim}')
trainer.fit(asr_model)
if __name__ == '__main__':
main() # noqa pylint: disable=no-value-for-parameter |
UPD: At the beginning of training, scalers work, after 5-10 thousand steps, this error appears. |
This might be an issue with Elastic- can you check the Elastic docker container logs? |
The error existed for one week. She disappeared today. |
It's using Elastic |
the situation repeated itself. this time, the api server rebooted quickly. apiserver:
|
Can you share your code? Something seems to be causing an illegal query, but I can't figure out what it is |
My code is here Server deployed by helm - name: elasticsearch
repository: https://charts.bitnami.com/bitnami
version: 7.17.3 |
some more logs from apiserver
|
@kzelias the last server version has some fixes that are related to this issue - can you try with v1.15.0? |
i got the exact same issue. i added mitmproxy between clearml and elastic and saw this: ==== REQUEST ====
POST http://10.42.0.84:9200/events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b/_search
Headers[(b'Host', b'10.42.0.84:9200'), (b'Accept-Encoding', b'identity'), (b'Content-Length', b'128'), (b'user-agent', b'elasticsearch-py/8.17.0 (Python/3.9.21; elastic-transport/8.15.1)'), (b'connection', b'keep-alive'), (b'accept', b'application/vnd.elasticsearch+json; compatible-with=8'), (b'content-type', b'application/vnd.elasticsearch+json; compatible-with=8'), (b'x-elastic-client-meta', b'es=8.17.0,py=3.9.21,t=8.15.1,ur=1.26.20')]
{"size":10000,"query":{"bool":{"must":[{"terms":{"task":["c3e7df9e60634bb381608beb42e131b9"]}},{"term":{"iter":-2147483648}}]}}}
=================
==== RESPONSE ====
400
Headers[(b'X-elastic-product', b'Elasticsearch'), (b'content-type', b'application/json; charset=UTF-8'), (b'content-length', b'1888')]
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b","node":"700B4W_-QwGvSnzaqTOIKQ","reason":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory.","caused_by":{"type":"illegal_argument_exception","reason":"Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [metric] in order to load field data by uninverting the inverted index. Note that this can use significant memory."}}},"status":400} |
ok i think our issue was that we had to replace elastic (we created it with wrong volumes) and then it didn't rerun the migrations ... |
Hello! I have two identical experiments.
For the first one, the scalars are displayed correctly, for the second one I get an error. The rest of the parameters are logged correctly, the problem is in the scalars.
What could be the reason?
Work:

Error:

The text was updated successfully, but these errors were encountered: