Control symbols (here it is [CLS] and [SEP] ) are tokenized which should not be #306

weiczhu · 2019-03-15T06:46:07Z

Loaded a trained SentencePiece model.
['▁', '[', 'C', 'L', 'S', ']', 'あ', 'と', '、', '梱包', 'に', '関', 'して', 'は', '、', '塩', 'が', '袋', 'から', '少し', '漏', 'れ', '出', 'て', 'いた', 'ので', '少し', '雑', 'な', '印', '象', 'を', '受', 'け', 'ました', '。', '[', 'S', 'E', 'P', ']']

We should expect the tokenizer should NEVER to break the control symbols

taku910 · 2019-03-15T08:19:28Z

You might want to use user defined symbols instead of user defined symbols.

% spm_train --input=... --model_prefix=... --user_defined_symbols=[CLS],[SEP]

You can see #215 for more details.

DomHudson · 2019-10-26T19:57:58Z

Note: If you want to mimic BERT use

            --unk_piece=[UNK]
            --pad_piece=[PAD]
            --user_defined_symbols=[CLS],[SEP],[MASK]

wppply · 2020-07-30T18:31:11Z

Note: If you want to mimic BERT use

            --unk_piece=[UNK]
            --pad_piece=[PAD]
            --user_defined_symbols=[CLS],[SEP],[MASK]

In this case, the inital [CLS] will be tokenize to two word, will that affect the training of bert?

['▁', '[CLS]', '▁good', '▁good', '▁stud', 'y', ',', '▁day', 'day', '▁up', '[SEP]']

taku910 closed this as completed Mar 15, 2019

haven-jeon mentioned this issue Dec 17, 2019

[SEP], [CLS] 등 스페셜 토큰의 토크나이저 이슈 SKTBrain/KoBERT#11

Closed

broken mentioned this issue Jul 21, 2021

Is there any way to handle / preserve SPECIAL tokens . tensorflow/text#656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control symbols (here it is [CLS] and [SEP] ) are tokenized which should not be #306

Control symbols (here it is [CLS] and [SEP] ) are tokenized which should not be #306

weiczhu commented Mar 15, 2019 •

edited

Loading

taku910 commented Mar 15, 2019

DomHudson commented Oct 26, 2019 •

edited

Loading

wppply commented Jul 30, 2020

Control symbols (here it is [CLS] and [SEP] ) are tokenized which should not be #306

Control symbols (here it is [CLS] and [SEP] ) are tokenized which should not be #306

Comments

weiczhu commented Mar 15, 2019 • edited Loading

taku910 commented Mar 15, 2019

DomHudson commented Oct 26, 2019 • edited Loading

wppply commented Jul 30, 2020

weiczhu commented Mar 15, 2019 •

edited

Loading

DomHudson commented Oct 26, 2019 •

edited

Loading