-
Notifications
You must be signed in to change notification settings - Fork 1.2k
What is the difference between --user_defined_symbols and --control_symbols #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
SentencePiece manages vocab id <=> token mapping. control_symbols just reserve ids for the specified token(s). So, even if this token appears in the input, this token is not segmented. User has to insert the id after segmentation as follows:
This code inserts an id after the id sequence for 'this is a test'. On the other hand, the tokens with --user_defined_symbols are always segmented as one symbol. So, we can call like.
For experimental purpose, user-defined-symbols should be easy as you can control the behavior just by tweaking the input. However, when you want to deploy the system as a user-facing product, user-defined-symbols would not be appropriate as user can change/tweak the behavior by injecting these special symbols. |
thanks 💯 |
Hi @taku910 ! |
Firstly, thanks for making this library! Very useful and easy to use.
I am wondering what is the difference between these two options:
I guess
user_defined_symbols
means a way to bypass splitting of some tokens (is that correct?).I am curious what
control_symbols
are intended for?Thanks in advance for your time taken to respond to this.
The text was updated successfully, but these errors were encountered: