From 3011ac2ae02613a176ef8c50277ab38dfe0e4f67 Mon Sep 17 00:00:00 2001 From: mymusise Date: Mon, 13 Mar 2023 15:35:11 +0800 Subject: [PATCH] add Transformers Requires for `Tokenize datasets` Signed-off-by: mymusise --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 07527f8..957f206 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ This code was fairly quickly thrown together and may contains many, many bugs. F ## Tokenize datasets -First, we tokenize the data so we never have to worry about the tokenizer again. The tokenization script takes in a JSONL (each row containing the key `"text"` for the document text), and effectively concatenates, tokenizes, and slices into `max_seq_length` chunks. +*Requires using the **Transformers** PR [here](https://github.com/huggingface/transformers/pull/21955/), based on the fork [here](https://github.com/zphang/transformers/tree/llama_push). First, we tokenize the data so we never have to worry about the tokenizer again. The tokenization script takes in a JSONL (each row containing the key `"text"` for the document text), and effectively concatenates, tokenizes, and slices into `max_seq_length` chunks. (This is a quick and dirty script that loads the whole dataset into memory.)