Add support for Scriptable BERT tokenizer #1707

parmeet · 2022-05-10T17:05:10Z

This PR adds support for scriptable BERT tokenizer.

Initial Implementation: Our implementation is derived from the work of https://github.com/LieluoboAi/radish. We have made following major amendments in their implementation:

Replaced usage of utfcpp with utfproc itself for converting string to and from unicode. This reduces additional dependency on utfcpp.
Replaced usage of std::unordered_map with torchtext's Vocab implementation to perform look-up.
Fixed wordpiece (max_seg_) algorithm to match HuggingFace (HF) implementation
Fixed stripping issue by sending stripped text directly from python (\u2048 at end of string cannot be removed trivially in C++)
Replaced their implementation of splitting strings based on whitespace with torchtext's split_
Perform to_lower directly on unicode strings. Also corrected the logic of combining flags to perform lowering and stripping accents to match HF implementation
Fixed _is_control implementation to match HF implementation
Remove comparison with kChinesePunts to match HF's implementation of _is_punctuation
Changed UString type from uint16_t to uint32_t. On rare occasions when a unicode code point cannot fit in a uint16_t type it causes errors.

Testing

Verified that the results matches with HF BERT Tokenizer on EnWik9 dataset (13147026 rows).

usage

bert_base_uncased_vocab_file = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt"
import torchtext.transforms as T
from torchtext.utils import get_asset_local_path
# Instantiate tokenizer with lower case, and return tokens=True (we also support return token IDs instead)
bert_tokenizer = T.BERTTokenizer(get_asset_local_path(bert_base_uncased_vocab_file),
                                    do_lower_case=True, strip_accents=None, return_tokens=True)

# non-batch API
tokens = bert_tokenizer("Hello world")
# out: ['hello', 'world']

# batch API 
tokens = bert_tokenizer(["Hello world","How are you!"])
# out: [['hello', 'world'], ['how', 'are', 'you', '!']]

Follow-up:

Perform batch processing directly in C++ instead of iterating over input sentences in python

This reverts commit a0caeb1.

Nayef211

Just left a couple of noob questions and suggestions. Overall LGTM. Thanks for adding the BERT tokenizer to torchtext @parmeet! This looks like it was a very complex class to implement 🚀

test/test_transforms.py

torchtext/transforms.py

Nayef211 · 2022-05-23T19:27:45Z

torchtext/transforms.py

+            for text in input:
+                if self._return_tokens:
+                    tokens.append(self._tokenize(text))
+                else:
+                    tokens.append(self._encode(text))


Is there a way we could pass in the batch input directly to the C++ kernel and do the for-loop in the kernel itself? As we've seen in previous benchmarking efforts, a lot of time is spent on passing data back and forth between Python and C++ and we may be able to get significant perf gains just by passing the entire list in one go.

Yes, that's a great idea. Let's do it in follow-up PR.

torchtext/csrc/bert_tokenizer.h

Nayef211 · 2022-05-23T19:44:19Z

torchtext/csrc/bert_tokenizer.h

+
+namespace torchtext {
+
+typedef std::basic_string<uint32_t> UString;


Are we using std::basic_string here because the text being passed in from Python contains unicode which isn't compatible with std::string?

The string passed from python is UTF-8 encoded bytes. UString is the container to store the unicode code points when converting string to unicode and vice-versa.

Nayef211 · 2022-05-23T19:49:45Z

torchtext/csrc/bert_tokenizer.cpp

+
+namespace torchtext {
+
+std::string BERTEncoder::kUnkToken = "[UNK]";


Noob question: why do we make kUnkToken a static property rather than a constant?

ohh, just following original implementation :).

torchtext/csrc/bert_tokenizer.cpp

parmeet · 2022-05-25T13:12:43Z

Thanks @Nayef211 for the thorough review and feedback. I have address the comments :).

parmeet · 2022-05-25T13:38:37Z

Torchtext CI is failing for windows on python 3.7 Seems like following is the culprit:

error: can't copy 'build\lib.win-amd64-3.7\pyd': doesn't exist or not a regular file

Any suggestions @mthrok, @atalman what's going on in here?

cc: @Nayef211

This reverts commit 0a64f89.

Nayef211 · 2022-05-25T15:12:23Z

@parmeet I'm also seeing failures for test_bert_tokenizer on several platforms. Let's try to fix these before landing

parmeet · 2022-05-25T15:17:52Z

@parmeet I'm also seeing failures for test_bert_tokenizer on several platforms. Let's try to fix these before landing

yaa, looking at them. Not sure what went wrong. Now locally the tests are passing. Will look into CI results.

philschmid · 2022-07-06T14:11:23Z

Hello 🙋🏻‍♂️

I was trying to script/trace the new BERTTokenizer but without any success. What does scriptable for you mean in this context?

Here is my example on how tried to trace it

bert_base_uncased_vocab_file = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt"
import torchtext.transforms as T
from torchtext.utils import get_asset_local_path
# Instantiate tokenizer with lower case, and return tokens=True (we also support return token IDs instead)
bert_tokenizer = T.BERTTokenizer(get_asset_local_path(bert_base_uncased_vocab_file),
                                    do_lower_case=True, strip_accents=None, return_tokens=True)

traced_tokenizer = torch.jit.trace(bert_tokenizer, "test")

here is the error

RuntimeError: 
Module 'BERTTokenizer' has no attribute 'bert_model' (This attribute exists on the Python module, but we failed to convert Python type: 'torchtext._torchtext.BERTEncoder' to a TorchScript type. Only tensors and (possibly nested) tuples of tensors, lists, or dictsare supported as inputs or outputs of traced functions, but instead got value of type BERTEncoder.. Its type was inferred; try adding a type annotation for the attribute.):
  File "/home/ubuntu/miniconda3/envs/optimum/lib/python3.8/site-packages/torchtext/transforms.py", line 603
    def _batch_encode(self, text: List[str]) -> List[List[str]]:
        """Batch version of _encode i.e operate on list of str"""
        token_ids: List[List[int]] = self.bert_model.batch_encode([t.strip() for t in text])
                                     ~~~~~~~~~~~~~~~ <--- HERE
        tokens_ids_str: List[List[str]] = [[str(t) for t in token_id] for token_id in token_ids]
        return tokens_ids_str

parmeet · 2022-07-06T16:26:52Z

@philschmid could you try with torch.jit.script for scripting?

parmeet added 18 commits April 18, 2022 15:17

add classes

873fc43

Merge branch 'main' of github.com:pytorch/text into bert_tokenizer

914cf71

Merge branch 'main' of github.com:pytorch/text into bert_tokenizer

82ac4d2

add basic functions

f17cc21

Merge branch 'main' of github.com:pytorch/text into bert_tokenizer

06ec41b

minor updates

4276030

minor update

e39826e

added submodule

aaad788

initial run

4f892e1

added pybinded transform and test structure

ef1b1f7

fixed few bugs

9414cda

fixed _is_control

ae6206b

using python strip and removing it from C++

aa543af

fix UString type

67c452f

partially add code for scripting

d442d86

add support for scripting

434c002

minor edit

74f1231

minor edit

e67b513

facebook-github-bot added the cla signed label May 10, 2022

parmeet added 11 commits May 10, 2022 20:46

remove chinese punctuation

70d0fc8

fix lint

3a27236

fix lint

1a77b8b

adding to_lower option

a0caeb1

Revert "adding to_lower option"

5cf80a1

This reverts commit a0caeb1.

add to_lower option, need to fix unit test

d0b4e7d

update to_lower

653515c

fix upper case tests

806d67e

modify test suit

fc0a608

Merge branch 'main' of github.com:pytorch/text into bert_tokenizer

c26fb4b

minor edits

2dd48ac

parmeet added 4 commits May 16, 2022 11:24

fixed linter

9e4c098

undo changes in clip test

ab177ea

add fix for 3332 code point

fa498e5

fix lint

6de32f1

parmeet marked this pull request as ready for review May 17, 2022 19:25

parmeet changed the title ~~[WIP] Add support for Scriptable BERT tokenizer~~ Add support for Scriptable BERT tokenizer May 17, 2022

parmeet requested review from abhinavarora, Nayef211 and vcm2114 May 17, 2022 19:39

Merge branch 'main' of github.com:pytorch/text into bert_tokenizer

9b2038b

Nayef211 approved these changes May 23, 2022

View reviewed changes

parmeet added 2 commits May 25, 2022 09:07

fix doc strings and C++ contructor initializer list

0a64f89

fix lint

2387924

parmeet added 2 commits May 25, 2022 09:39

Merge branch 'main' of github.com:pytorch/text into bert_tokenizer

5900bdf

Revert "fix doc strings and C++ contructor initializer list"

abd81fc

This reverts commit 0a64f89.

re-address comments w.r.t revert

d940ecb

parmeet merged commit da509e1 into pytorch:main May 25, 2022

parmeet deleted the bert_tokenizer branch May 25, 2022 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Scriptable BERT tokenizer #1707

Add support for Scriptable BERT tokenizer #1707

parmeet commented May 10, 2022 •

edited

Loading

Nayef211 left a comment

Nayef211 May 23, 2022

parmeet May 25, 2022

Nayef211 May 23, 2022

parmeet May 25, 2022

Nayef211 May 23, 2022

parmeet May 25, 2022

parmeet commented May 25, 2022

parmeet commented May 25, 2022

Nayef211 commented May 25, 2022

parmeet commented May 25, 2022

philschmid commented Jul 6, 2022

parmeet commented Jul 6, 2022


		namespace torchtext {

		typedef std::basic_string<uint32_t> UString;


		namespace torchtext {

		std::string BERTEncoder::kUnkToken = "[UNK]";

Add support for Scriptable BERT tokenizer #1707

Add support for Scriptable BERT tokenizer #1707

Conversation

parmeet commented May 10, 2022 • edited Loading

Testing

usage

Nayef211 left a comment

Choose a reason for hiding this comment

Nayef211 May 23, 2022

Choose a reason for hiding this comment

parmeet May 25, 2022

Choose a reason for hiding this comment

Nayef211 May 23, 2022

Choose a reason for hiding this comment

parmeet May 25, 2022

Choose a reason for hiding this comment

Nayef211 May 23, 2022

Choose a reason for hiding this comment

parmeet May 25, 2022

Choose a reason for hiding this comment

parmeet commented May 25, 2022

parmeet commented May 25, 2022

Nayef211 commented May 25, 2022

parmeet commented May 25, 2022

philschmid commented Jul 6, 2022

parmeet commented Jul 6, 2022

parmeet commented May 10, 2022 •

edited

Loading