From 3011ac2ae02613a176ef8c50277ab38dfe0e4f67 Mon Sep 17 00:00:00 2001
From: mymusise <mymusise1@gmail.com>
Date: Mon, 13 Mar 2023 15:35:11 +0800
Subject: [PATCH] add Transformers Requires for `Tokenize datasets`

Signed-off-by: mymusise <mymusise1@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 07527f8..957f206 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ This code was fairly quickly thrown together and may contains many, many bugs. F
 
 ## Tokenize datasets
 
-First, we tokenize the data so we never have to worry about the tokenizer again. The tokenization script takes in a JSONL (each row containing the key `"text"` for the document text), and effectively concatenates, tokenizes, and slices into `max_seq_length` chunks.
+*Requires using the **Transformers** PR [here](https://github.com/huggingface/transformers/pull/21955/), based on the fork [here](https://github.com/zphang/transformers/tree/llama_push). First, we tokenize the data so we never have to worry about the tokenizer again. The tokenization script takes in a JSONL (each row containing the key `"text"` for the document text), and effectively concatenates, tokenizes, and slices into `max_seq_length` chunks.
 
 (This is a quick and dirty script that loads the whole dataset into memory.)