-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Scriptable BERT tokenizer #1707
Conversation
This reverts commit a0caeb1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a couple of noob questions and suggestions. Overall LGTM. Thanks for adding the BERT tokenizer to torchtext @parmeet! This looks like it was a very complex class to implement 🚀
for text in input: | ||
if self._return_tokens: | ||
tokens.append(self._tokenize(text)) | ||
else: | ||
tokens.append(self._encode(text)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we could pass in the batch input directly to the C++ kernel and do the for-loop in the kernel itself? As we've seen in previous benchmarking efforts, a lot of time is spent on passing data back and forth between Python and C++ and we may be able to get significant perf gains just by passing the entire list in one go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a great idea. Let's do it in follow-up PR.
|
||
namespace torchtext { | ||
|
||
typedef std::basic_string<uint32_t> UString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using std::basic_string
here because the text being passed in from Python contains unicode which isn't compatible with std::string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string passed from python is UTF-8 encoded bytes. UString is the container to store the unicode code points when converting string to unicode and vice-versa.
|
||
namespace torchtext { | ||
|
||
std::string BERTEncoder::kUnkToken = "[UNK]"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question: why do we make kUnkToken
a static
property rather than a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh, just following original implementation :).
Thanks @Nayef211 for the thorough review and feedback. I have address the comments :). |
@parmeet I'm also seeing failures for |
yaa, looking at them. Not sure what went wrong. Now locally the tests are passing. Will look into CI results. |
Hello 🙋🏻♂️ I was trying to script/trace the new Here is my example on how tried to trace it bert_base_uncased_vocab_file = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt"
import torchtext.transforms as T
from torchtext.utils import get_asset_local_path
# Instantiate tokenizer with lower case, and return tokens=True (we also support return token IDs instead)
bert_tokenizer = T.BERTTokenizer(get_asset_local_path(bert_base_uncased_vocab_file),
do_lower_case=True, strip_accents=None, return_tokens=True)
traced_tokenizer = torch.jit.trace(bert_tokenizer, "test") here is the error RuntimeError:
Module 'BERTTokenizer' has no attribute 'bert_model' (This attribute exists on the Python module, but we failed to convert Python type: 'torchtext._torchtext.BERTEncoder' to a TorchScript type. Only tensors and (possibly nested) tuples of tensors, lists, or dictsare supported as inputs or outputs of traced functions, but instead got value of type BERTEncoder.. Its type was inferred; try adding a type annotation for the attribute.):
File "/home/ubuntu/miniconda3/envs/optimum/lib/python3.8/site-packages/torchtext/transforms.py", line 603
def _batch_encode(self, text: List[str]) -> List[List[str]]:
"""Batch version of _encode i.e operate on list of str"""
token_ids: List[List[int]] = self.bert_model.batch_encode([t.strip() for t in text])
~~~~~~~~~~~~~~~ <--- HERE
tokens_ids_str: List[List[str]] = [[str(t) for t in token_id] for token_id in token_ids]
return tokens_ids_str |
@philschmid could you try with torch.jit.script for scripting? |
This PR adds support for scriptable BERT tokenizer.
Initial Implementation: Our implementation is derived from the work of https://github.com/LieluoboAi/radish. We have made following major amendments in their implementation:
Testing
Verified that the results matches with HF BERT Tokenizer on EnWik9 dataset (13147026 rows).
usage
Follow-up: