What is the difference between --user_defined_symbols and --control_symbols #215

thammegowda · 2018-10-20T01:19:05Z

Firstly, thanks for making this library! Very useful and easy to use.

I am wondering what is the difference between these two options:

--control_symbols (comma-separated list of control symbols)  type: string  default: 
--user_defined_symbols (comma separated list of user-defined symbols)  type: string  default:

I guess user_defined_symbols means a way to bypass splitting of some tokens (is that correct?).
I am curious what control_symbols are intended for?

Thanks in advance for your time taken to respond to this.

The text was updated successfully, but these errors were encountered:

taku910 · 2018-10-22T05:16:24Z

SentencePiece manages vocab id <=> token mapping.

control_symbols just reserve ids for the specified token(s). So, even if this token appears in the input, this token is not segmented. User has to insert the id after segmentation as follows:

sp = spm.SentencePieceProcessor()
sp.Load('model')
ids = sp.EncodeAsIds('this is a test') + [sp.PieceToId('<c>')]

This code inserts an id after the id sequence for 'this is a test'.

On the other hand, the tokens with --user_defined_symbols are always segmented as one symbol. So, we can call like.

tokens = sp.EncodeAsIds('this is a test<c>')

For experimental purpose, user-defined-symbols should be easy as you can control the behavior just by tweaking the input. However, when you want to deploy the system as a user-facing product, user-defined-symbols would not be appropriate as user can change/tweak the behavior by injecting these special symbols.

thammegowda · 2018-10-22T16:59:22Z

thanks 💯

kitkhai · 2024-01-03T16:39:28Z

Hi @taku910 !
Just curious, when we define a list of user_defined_symbols when training a SentencePiece (BPE) model, will there be corresponding merge rules so that the user_defined_symbols tokens stay as one during segmentation?

thammegowda closed this as completed Oct 22, 2018

taku910 mentioned this issue Mar 15, 2019

Control symbols (here it is [CLS] and [SEP] ) are tokenized which should not be #306

Closed

taku910 mentioned this issue Feb 11, 2021

Clarification: issue regarding encoding meta tokens #624

Closed

taku910 mentioned this issue Jul 4, 2021

Unable to tokenize <s>, <pad>, </s>, and <unk> correctly in Python #667

Closed

taku910 mentioned this issue Mar 29, 2023

Question about encode sentence pieces? #838

Closed

taku910 mentioned this issue Nov 27, 2023

How to use EncodeAsIds for text that contains `<s>/</s>' #940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the difference between --user_defined_symbols and --control_symbols #215

What is the difference between --user_defined_symbols and --control_symbols #215

thammegowda commented Oct 20, 2018

taku910 commented Oct 22, 2018 •

edited

Loading

thammegowda commented Oct 22, 2018

kitkhai commented Jan 3, 2024 •

edited

Loading

What is the difference between --user_defined_symbols and --control_symbols #215

What is the difference between --user_defined_symbols and --control_symbols #215

Comments

thammegowda commented Oct 20, 2018

taku910 commented Oct 22, 2018 • edited Loading

thammegowda commented Oct 22, 2018

kitkhai commented Jan 3, 2024 • edited Loading

taku910 commented Oct 22, 2018 •

edited

Loading

kitkhai commented Jan 3, 2024 •

edited

Loading