We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loaded a trained SentencePiece model. ['▁', '[', 'C', 'L', 'S', ']', 'あ', 'と', '、', '梱包', 'に', '関', 'して', 'は', '、', '塩', 'が', '袋', 'から', '少し', '漏', 'れ', '出', 'て', 'いた', 'ので', '少し', '雑', 'な', '印', '象', 'を', '受', 'け', 'ました', '。', '[', 'S', 'E', 'P', ']']
We should expect the tokenizer should NEVER to break the control symbols
The text was updated successfully, but these errors were encountered:
You might want to use user defined symbols instead of user defined symbols.
% spm_train --input=... --model_prefix=... --user_defined_symbols=[CLS],[SEP]
You can see #215 for more details.
Sorry, something went wrong.
Note: If you want to mimic BERT use
--unk_piece=[UNK] --pad_piece=[PAD] --user_defined_symbols=[CLS],[SEP],[MASK]
Note: If you want to mimic BERT use --unk_piece=[UNK] --pad_piece=[PAD] --user_defined_symbols=[CLS],[SEP],[MASK]
In this case, the inital [CLS] will be tokenize to two word, will that affect the training of bert?
['▁', '[CLS]', '▁good', '▁good', '▁stud', 'y', ',', '▁day', 'day', '▁up', '[SEP]']
No branches or pull requests
Loaded a trained SentencePiece model.
['▁', '[', 'C', 'L', 'S', ']', 'あ', 'と', '、', '梱包', 'に', '関', 'して', 'は', '、', '塩', 'が', '袋', 'から', '少し', '漏', 'れ', '出', 'て', 'いた', 'ので', '少し', '雑', 'な', '印', '象', 'を', '受', 'け', 'ました', '。', '[', 'S', 'E', 'P', ']']
We should expect the tokenizer should NEVER to break the control symbols
The text was updated successfully, but these errors were encountered: