Skip to content

What is the difference between --user_defined_symbols and --control_symbols #215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thammegowda opened this issue Oct 20, 2018 · 3 comments

Comments

@thammegowda
Copy link

Firstly, thanks for making this library! Very useful and easy to use.

I am wondering what is the difference between these two options:

--control_symbols (comma-separated list of control symbols)  type: string  default: 
--user_defined_symbols (comma separated list of user-defined symbols)  type: string  default: 

I guess user_defined_symbols means a way to bypass splitting of some tokens (is that correct?).
I am curious what control_symbols are intended for?

Thanks in advance for your time taken to respond to this.

@taku910
Copy link
Collaborator

taku910 commented Oct 22, 2018

SentencePiece manages vocab id <=> token mapping.

control_symbols just reserve ids for the specified token(s). So, even if this token appears in the input, this token is not segmented. User has to insert the id after segmentation as follows:

sp = spm.SentencePieceProcessor()
sp.Load('model')
ids = sp.EncodeAsIds('this is a test') + [sp.PieceToId('<c>')]

This code inserts an id after the id sequence for 'this is a test'.

On the other hand, the tokens with --user_defined_symbols are always segmented as one symbol. So, we can call like.

tokens = sp.EncodeAsIds('this is a test<c>')

For experimental purpose, user-defined-symbols should be easy as you can control the behavior just by tweaking the input. However, when you want to deploy the system as a user-facing product, user-defined-symbols would not be appropriate as user can change/tweak the behavior by injecting these special symbols.

@thammegowda
Copy link
Author

thanks 💯

@kitkhai
Copy link

kitkhai commented Jan 3, 2024

Hi @taku910 !
Just curious, when we define a list of user_defined_symbols when training a SentencePiece (BPE) model, will there be corresponding merge rules so that the user_defined_symbols tokens stay as one during segmentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants