Alibaba-NLP · Dec 30, 2023
diff --git a/‎README.md
+54-2 b/‎README.md
+54-2
diff --git a/‎datasets/msmarco.json
+100 b/‎datasets/msmarco.json
+100
diff --git a/‎datasets/text_pairs.json
+10 b/‎datasets/text_pairs.json
+10
diff --git a/‎eval.py
+162 b/‎eval.py
+162
diff --git a/‎eval.sh
+26 b/‎eval.sh
+26
diff --git a/‎img/indomain.png
1.64 MB b/‎img/indomain.png
1.64 MB
diff --git a/‎img/outdomain1.png
1.67 MB b/‎img/outdomain1.png
1.67 MB
diff --git a/‎img/outdomain2.png
1.24 MB b/‎img/outdomain2.png
1.24 MB
diff --git a/‎pretrain.py
+265 b/‎pretrain.py
+265
diff --git a/‎pretrain.sh
+44 b/‎pretrain.sh
+44
diff --git a/‎rankdata/trec19/qrels.txt
+9,260 b/‎rankdata/trec19/qrels.txt
+9,260
diff --git a/‎rankdata/trec19/top1000.json
+43,000 b/‎rankdata/trec19/top1000.json
+43,000
diff --git a/‎sft.py
+377 b/‎sft.py
+377
diff --git a/‎sft.sh
+52 b/‎sft.sh
+52
@@ -1,2 +1,54 @@
-# RankingGPT
-code for paper 《RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement》
+# Description
+This is the official code for paper [RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement](https://arxiv.org/abs/2311.16720).
+
+# Requirements
+```
+transformers==4.28.1
+datasets
+pyserini
+torch==1.13.1
+```
+
+# Data
+
+- ./datasets/text_pairs.json: Weakly supervised text pairs
+
+- ./datasets/msmarco.json: Supervised fine-tuning data
+
+- ./rankdata/trec19: Top-1000 query-document pairs recalled by BM25
+
+
+# Two-stage Training
+
+## Pretrain
+```
+bash pretrain.sh bigscience/bloom-560m bloom-560m BloomBlock
+```
+
+## SFT
+```
+bash sft.sh ./outputs_pretrain_bloom-560m bloom-560m 16 BloomBlock
+```
+
+# Evaluation
+```
+bash eval.sh ./outputs_sft_bloom-560m trec19 bloom-560m
+```
+
+# Results
+*Ranking results (NDCG@10) of the top-1000 candidate documents recalled by BM25.*
+|         | DL19 | DL20 | BEIR | url |
+|---------|------|------|------|-----------------|
+| MonoBERT-340M | 72.3 | 70.3 | 50.5 |     [huggingface](https://huggingface.co/veneres/monobert-msmarco)          |
+| MonoT5-220M  | 71.5 | 69.7 | 49.3 |     [huggingface](https://huggingface.co/castorini/monot5-base-msmarco)          |
+| MonoT5-770M  | 73.2 | 71.2 | 53.1 |    [huggingface](https://huggingface.co/castorini/monot5-large-msmarco)          |
+| MonoT5-3B  | 72.8 | 74.5 | 54.6 |     [huggingface](https://huggingface.co/castorini/monot5-3b-msmarco)          |
+| RankT5-770M  | -    | -    | 53.7 |     [huggingface](https://huggingface.co/bergum/rank-T5-flan)           |
+| RankLLaMA| 74.6 | 76.6 | 52.5 |  [huggingface](https://huggingface.co/castorini/rankllama-v1-7b-lora-passage) |
+| RankingGPT-bloom-560m| 75.3 | 73.2 | 53.7 | [huggingface](https://huggingface.co/zyznull/RankingGPT-bloom-560m) [modelscope](https://modelscope.cn/models/damo/RankingGPT-bloom-560m)       |
+| RankingGPT-bloom-1b1| 75.6 | 73.2 | 54.5 | [huggingface](https://huggingface.co/zyznull/RankingGPT-bloom-1b1)  [modelscope](https://modelscope.cn/models/damo/RankingGPT-bloom-1b1)        |
+| RankingGPT-bloom-3b| 76.8 | 73.6 | 56.2 | [huggingface](https://huggingface.co/zyznull/RankingGPT-bloom-3b)  [modelscope](https://modelscope.cn/models/damo/RankingGPT-bloom-3b)        |
+| RankingGPT-bloom-7b| 77.3 | 74.6 | 56.6 | [huggingface](https://huggingface.co/zyznull/RankingGPT-bloom-7b)  [modelscope](https://modelscope.cn/models/damo/RankingGPT-bloom-7b)        |
+| RankingGPT-llama2-7b| 76.2 | 76.3 | 57.8 | [huggingface](https://huggingface.co/zyznull/RankingGPT-llama2-7b)  [modelscope](https://modelscope.cn/models/damo/RankingGPT-llama2-7b)        |
+| RankingGPT-baichuan2-7b| 75.9 | 74.3 | 57.5 |  [huggingface](https://huggingface.co/zyznull/RankingGPT-baichuan2-7b) [modelscope](https://modelscope.cn/models/damo/RankingGPT-baichuan2-7b)        |
+| RankingGPT-qwen-7b| 75.8 | 74.3 | 58.3 | [huggingface](https://huggingface.co/zyznull/RankingGPT-qwen-7b)  [modelscope](https://modelscope.cn/models/damo/RankingGPT-qwen-7b)        |
@@ -0,0 +1,10 @@
+{"query": "Document: The name \"aardvark\" is Afrikaans (Afrikaans pronunciation: [\u02c8\u0251\u02d0rtfark]), comes from earlier Afrikaans (erdvark) and means \"earth pig\" or \"ground pig\" (aarde: earth/ground, vark: pig), because of its burrowing habits. The name Orycteropus means burrowing foot, and the name afer refers to Africa. Query:", "response": "is aardvark derived from afrikaans?"}
+{"query": "Document: There is indeed a transform called discrete Laplace transform and it is of course closely related to the $\\mathcal{Z}$-transform. The (unilateral) discrete Laplace transform of a sequence $f_n$ is defined by (cf. link)\n\n$$\\mathcal{L}_T\\{f_n\\}=\\sum_{n=0}^{\\infty}f_ne^{-snT}$$\n\nwith some $T&gt;0$. The discrete Laplace transform can be interpreted as the Laplace transform of a sampled function $f(t)\\cdot\\sum_n\\delta(t-nT)$  with $f_n=f(nT)$.\n\nIn practice it is not convenient to have the factor $e^{sT}$ appear everywhere. If one substitutes $z=e^{sT}$, the discrete Laplace transform is called (unilateral) $\\mathcal{Z}$-transform:\n\n$$\\mathcal{Z}\\{f_n\\}=\\sum_{n=0}^{\\infty}f_nz^{-n}$$\n\nThe same can obviously be done for the bilateral versions of the transforms, where the integrals and the sums start at $-\\infty$. Query:", "response": "Why Z-transform is considered as separate transform?"}
+{"query": "Document: I added a new condition to org-return, very similar to how it handles a link containing the org-link face property ..., with the new condition being triggered if the org-footnote face property is present.  We use org-footnote-action to jump to/from.  The code is from Org version 8.2.10 that ships with Emacs 25.2.1.\n\n  (require 'org)\n\n(setq org-return-follows-link t)\n\n(defun org-return (&amp;optional indent)\n  \"Goto next table row or insert a newline.\nCalls `org-table-next-row' or `newline', depending on context.\nSee the individual commands for more information.\"\n  (interactive)\n  (let (org-ts-what)\n    (cond\n      ((or (bobp) (org-in-src-block-p))\n        (if indent (newline-and-indent) (newline)))\n      ((org-at-table-p)\n        (org-table-justify-field-maybe)\n        (call-interactively 'org-table-next-row))\n      ;; when `newline-and-indent' is called within a list, make sure\n      ;; text moved stays inside the item.\n      ((and (org-in-item-p) indent)\n        (if (and (org-at-item-p) (&gt;= (point) (match-end 0)))\n          (progn\n            (save-match-data (newline))\n            (org-indent-line-to (length (match-string 0))))\n          (let ((ind (org-get-indentation)))\n            (newline)\n            (if (org-looking-back org-list-end-re)\n                (org-indent-line)\n              (org-indent-line-to ind)))))\n      ((and org-return-follows-link\n        (org-at-timestamp-p t)\n        (not (eq org-ts-what 'after)))\n          (org-follow-timestamp-link))\n      ((and org-return-follows-link\n            (let ((tprop (get-text-property (point) 'face)))\n              (or (eq tprop 'org-link)\n                  (and (listp tprop) (memq 'org-link tprop)))))\n         (call-interactively 'org-open-at-point))\n      ;;; NEW CONDITION:  `org-footnote' face property =&gt; `org-footnote-action'\n      ((and org-return-follows-link\n            (let ((tprop (get-text-property (point) 'face)))\n              (or (eq tprop 'org-footnote)\n                  (and (listp tprop) (memq 'org-footnote tprop)))))\n         (org-footnote-action))\n      ((and (org-at-heading-p)\n            (looking-at\n              (org-re \"\\\\([ \\t]+\\\\(:[[:alnum:]_@#%:]+:\\\\)\\\\)[ \\t]*$\")))\n        (org-show-entry)\n        (end-of-line 1)\n        (newline))\n     (t\n        (if indent\n          (newline-and-indent)\n          (newline)))))) Query:", "response": "org-return-follows-link with footnotes?"}
+{"query": "Document: The Ford Fusion Titanium has optional 19-inch wheels available in machined or dark stainless aluminum that add $695 to the price of the car. But the cost of replacing one tire (Continental ContiProContact) would be $244 to $292. Query:", "response": "how much do aluminum wheels cost"}
+{"query": "Document: In the following episode \"My Brother's Keeper\", Elena tells Damon he is the reason she and Stefan broke up and at the end of the episode Damon and Elena finally have sex. Query:", "response": "vampire diaries episode where elena and damon sleep together?"}
+{"query": "Document: Heavy manga spoilers ahead read at your own risk:\n\n\n  In manga ch 45, Kohaku asks if they are related to which Senku answers\nthat he and Byakuya are NOT blood related BUT they (he and the villagers)\nare hundreds of generations apart so it doesn't matter.\nYes they had children, no they are not related to Senku.\n(Note that they don't specifically show that Byakuya and Lillian were together but\nthere was the already married couple and they showed Connie and Shamil getting married so most\nprobably Byakuya ended up with Lillian) Query:", "response": "Did Senku's father have children with Lillian in Dr. Stone?"}
+{"query": "Document: Asker's rating. 1  The average life span of the female mosquito is 3 to 100 days; the male's is 10 to 20 days. 2  females lie eggs, little worms come out of the eggs and the worms become those little flying pests... Query:", "response": "what's the life expectancy of a mosquito"}
+{"query": "Document: So this thread is dedicated to posting of poor critiques of Marxian economics. First is diamonds in the desert. So it starts off with a man, a man of means, but not unlimited means, stranded in the desert. He is dehydrated and close to death. At this point another man comes along with apparently only carrying two items with him, and looking to make a deal. He sees the dehydrated man has a sack of diamonds all of good quality. Being a good capitalist he wants to make a deal; \"All your diamonds and you can have the water\". The dehydrated man gives him all he has for water. This example is one libertarian/liberal/capitalist apologists will use to prove value is subjective. How could it not? When you are dehydrated isn't water more valuable than a sack of diamonds? Such a great and simple thought experiment to show how Marx was wrong and capitalism is the best system ever. Well there are problems with this example:\n\n1. The first problem is probably the least obvious: Value isn't determined by extreme cases, value is an aggregate. One instance of a high price doesn't mean water is worth more than diamonds. \n\n2. In the aggregate trade is a zero sum gain. Rarely does one trade items of greatly mismatched value, especially with a currency as a medium of exchange. Pointing to a case of someone getting ripped off doesn't change this.\n\n3. It changes value into only the result of exchange without any context. Is it possible that diamonds literally litter the ground? That any person can just pick one up and trade it for anything? And is water a rare resource? If this is the cause then it is a fine example of LTV. One can't assume that this abstraction is like the real world in all cases except where it comes to the exchange itself. \n\n\nAnyone else have examples of poor liberal/libertarian/capitalist thought experiments that they believe disproves Marxian economics? Query:", "response": "Diamonds in the desert, and other badly thought out critiques of Marxian economic theory."}
+{"query": "Document: The oxford dictionary defines pilot as:\n\n\n  A person who operates the flying controls of an aircraft\n\n\nSo, technically, the drone operator should be called a pilot.\n\nFAA National Policy Order 8130.34C Airworthiness Certification of Unmanned Aircraft Systems and Optionally Piloted Aircraft Section 6 specifically calls the person operating the UAS (only if it has been issued with an airworthiness certificate ) Pilot.:\n\n\n  \n  UA Pilots and Observers.\n  \n  \n  a. PIC Roles and Responsibilities.\n  \n  (1) The PIC must perform crew duties for only one UA at a time.\n  \n  (2) All UA flight operations must have a designated PIC. The PIC has responsibility over each flight conducted and is accountable for the UA flight operation.\n  \n  (3) The PIC is responsible for the safety of the UA as well as persons and property along the UA flight path. This includes, but is not limited to, collision avoidance and the safety of persons and property in the air and on the ground.\n  \n  (4) The PIC must avoid densely populated areas and congested airways in accordance with \u00a7 91.319.\n\n\nThe order requires the PIC to have a minimum of FAA PPL:\n\n\n  b. UA PIC Certification and Ratings Requirements.\n  \n  (1) The PIC must hold and be in possession of, at a minimum, an FAA private pilot certificate, with either an airplane, rotorcraft, or powered-lift category; with single- or multiengine class ratings, appropriate to the type of UA being operated.\n  \n  (2) The PIC must have and be in possession of a valid second-class (or higher) airman medical certificate issued under 14 CFR part 67, Medical Standards and Certification.\n\n\nUK CAA also talks about 'Pilot Qualifications required to operate Unmanned Aircraft'. ICAO also uses the term 'pilot' for people controlling an UAV.\n\n\n\nBoth USAF and RAF call the UAV operators pilots- RAF calls them Remotely Piloted Aircraft System Pilots and USAF, Remotely Piloted Aircraft Pilots and they do get the 'wings'\n\n\n\n\"United States Air Force Unmanned Aircraft Operator Badge\" by SSgt Austin May of the USAF - http://www.af.mil/news/story.asp?id=123170151http://www.mildenhall.af.mil/news/story.asp?id=123170577. Licensed under Public Domain via Commons.\n\nHowever, there has been no UAV pilot license issued as far as I know. Query:", "response": "Should people flying UAVs be called \"Operators\" or \"Pilots\"?"}
+{"query": "Document: According to Google Earth, the nearest beach to Nashville TN is Myrtle Beach, South Carolina. According to the directions and route provided by Google Earth Myrtle Beach is 585 Miles from Nashville by car, aproximatly 10 hours and 42 minutes driving at the speed limit, not including time for travel plaza stops. Query:", "response": "how far is nashville tn from closest beach"}
@@ -0,0 +1,162 @@
+import torch
+from transformers import AutoTokenizer, AutoModel,AutoModelForCausalLM, LlamaTokenizer, LlamaForCausalLM,T5Tokenizer, T5ForConditionalGeneration
+import torch
+import argparse
+import json
+from tqdm import tqdm
+import os
+import copy
+
+def get_model_tokenizer(model_path):
+    if 'llama' in model_path.lower():
+        tokenizer = LlamaTokenizer.from_pretrained(model_path)
+        model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16, device_map="auto")
+    else:
+        tokenizer = AutoTokenizer.from_pretrained(model_path,trust_remote_code=True)
+        model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.float16, device_map="auto",trust_remote_code=True)
+    model.eval()
+    return model,tokenizer
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--model_path',
+                    default="",
+                    required=False)
+parser.add_argument('--res_path',
+                    default="",
+                    required=False)
+parser.add_argument('--rank_path',
+                    default="",
+                    required=False)
+parser.add_argument('--data_name',
+                    default='msmarco')
+
+args = parser.parse_args()
+
+
+model_path=args.model_path
+data_name=args.data_name
+
+IGNORE_INDEX = -100
+DEFAULT_PAD_TOKEN = "[PAD]"
+DEFAULT_EOS_TOKEN = "</s>"
+DEFAULT_BOS_TOKEN = "</s>"
+DEFAULT_UNK_TOKEN = "</s>"
+
+bsz=8
+
+prompt='Document: {doc} Query:'
+
+model,tokenizer=get_model_tokenizer(model_path)
+if 'qwen' in model_path.lower():
+    tokenizer.pad_token_id = tokenizer.eod_id
+
+def get_num_token(text):
+    return len(tokenizer.encode(text))
+
+prompt_len=get_num_token(prompt)
+print(f"prompt_len: {prompt_len}")
+
+
+def truncation(text,length):
+    text=tokenizer.decode(tokenizer.encode(text,max_length=length, add_special_tokens=False))
+    return text
+
+def _tokenize_fn(strings):
+    """Tokenize a list of strings."""
+    tokenized_list = [
+        tokenizer(
+            text,
+            return_tensors="pt",
+            padding="longest",
+        )['input_ids']
+        for text in strings
+    ]
+    input_ids = labels = [tokenized[0] for tokenized in tokenized_list]
+    input_ids_lens = labels_lens = [
+        tokenized.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
+    ]
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+        input_ids_lens=input_ids_lens,
+        labels_lens=labels_lens,
+    )
+
+all_examples=[]
+all_sources=[]
+all_qpids=[]
+all_queries=[]
+
+for line in tqdm(open(args.rank_path),desc='load data'):
+    ex = json.loads(line)
+    all_qpids.append((ex['qid'],ex['pid']))
+    if data_name!='arguana':
+        query = ex["query"].replace(DEFAULT_PAD_TOKEN,'PAD')
+        query_len = get_num_token(query)
+        passage_max_len = 512-prompt_len-query_len-10
+        source = prompt.format(doc = truncation(ex['passage'], passage_max_len)).replace(DEFAULT_PAD_TOKEN,'PAD')
+    else:
+        source = prompt.format(doc = truncation(ex['passage'], 256)).replace(DEFAULT_PAD_TOKEN,'PAD')
+        query = truncation(ex['query'], 256)
+    all_examples.append(source+query)
+    all_sources.append(source)
+    all_queries.append(query)
+
+
+with open(args.res_path,"w") as fw:
+    for index in tqdm(range(0,len(all_examples),bsz)):
+        examples=all_examples[index:index+bsz]
+        sources=all_sources[index:index+bsz]
+        qpids=all_qpids[index:index+bsz]
+        queries=all_queries[index:index+bsz]
+        qid, pid = qpids[0]
+        
+        examples_tokenized, sources_tokenized = [_tokenize_fn(strings) for strings in (examples, sources)]
+        input_ids = examples_tokenized["input_ids"]
+        
+        labels = copy.deepcopy(input_ids)
+        for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
+            label[:source_len] = IGNORE_INDEX
+        
+
+        for index in range(len(input_ids)):
+            input_ids[index]=input_ids[index][:-1]
+        
+        input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id).cuda()
+        
+        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX).cuda()
+        labels = labels[..., 1:].contiguous() #BL
+
+        with torch.no_grad():
+            lm_logits = model(input_ids=input_ids,attention_mask=input_ids.ne(tokenizer.pad_token_id))[0]
+            preds = torch.nn.functional.log_softmax(lm_logits,dim=-1)
+            label_no_ingore = torch.where(labels==-100,torch.ones(labels.shape).long().cuda(),labels)
+            logprobs = torch.gather(preds, -1, label_no_ingore.unsqueeze(dim=-1)).squeeze(dim=-1) # B L
+            indexs=(labels!=-100).long()
+            scores=(logprobs*indexs).sum(dim=-1)/indexs.sum(dim=-1)
+            scores=scores.cpu().tolist()
+
+        for index,score in enumerate(scores):
+            qid, pid=qpids[index]
+            print(" ".join([qid,"Q0",pid,"-1",str(score),model_path]),file=fw)
+        del lm_logits
+
+
+
+results={}
+for line in open(args.res_path):
+    line = line.strip().split()
+    qid = line[0]
+    pid = line[2]
+
+    score = float(line[4])
+    if qid not in results:
+        results[qid] = []
+    results[qid].append((pid,score))
+
+with open(args.res_path[:-4]+"_post.res","w") as fw:
+    for qid in results:
+        res = results[qid]
+        sorted_res = sorted(res,key = lambda x:-x[1])
+        for i,item in enumerate(sorted_res):
+            print(" ".join([qid, "Q0", item[0], str(i), str(item[1]), 'llm']),file=fw)
@@ -0,0 +1,26 @@
+model=$1
+data_name=$2
+rank_name=$3
+
+output_dir="./rankdata/${data_name}/${rank_name}/"
+mkdir -p ${output_dir}
+echo $output_dir
+
+recall_path="./rankdata/${data_name}/top1000.json"
+qrel_path="./rankdata/${data_name}/qrels.txt"
+
+
+echo ${rank_name}
+echo ${recall_path}
+echo ${qrel_path}
+
+
+python eval.py \
+    --model_path $model \
+    --res_path "${output_dir}/$rank_name.res" \
+    --rank_path $recall_path \
+    --data_name $data_name
+
+# ndcg
+python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 ${qrel_path} ${output_dir}/${rank_name}_post.res > ${output_dir}/${rank_name}_score.txt
+cat ${output_dir}/${rank_name}_score.txt
@@ -0,0 +1,265 @@
+#    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+import copy
+import logging
+from dataclasses import dataclass, field
+from typing import Optional, Dict, Sequence
+
+import numpy as np
+import torch
+import transformers
+from torch.utils.data import Dataset
+from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
+import datasets
+
+
+IGNORE_INDEX = -100
+DEFAULT_PAD_TOKEN = "[PAD]"
+DEFAULT_EOS_TOKEN = "</s>"
+DEFAULT_BOS_TOKEN = "<s>"
+DEFAULT_UNK_TOKEN = "<unk>"
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(default="bigscience/bloom-560m")
+    tokenizer_name_or_path: Optional[str] = field(default="bigscience/bloom-560m")
+
+
+@dataclass
+class DataArguments:
+    train_data_path: str = field(default=None, metadata={"help": "Path to the training data."})
+    dev_data_path: str = field(default=None, metadata={"help": "Path to the training data."})
+    mask_input: bool = field(default=False)
+
+
+@dataclass
+class TrainingArguments(transformers.Seq2SeqTrainingArguments):
+    cache_dir: Optional[str] = field(default=None)
+    optim: str = field(default="adamw_torch")
+    model_max_length: int = field(
+        default=2048,
+        metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
+    )
+    train_with_peft: bool = field(default=False, metadata={"help": "Is training with peft"})
+    
+
+
+def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str, save_peft=False):
+    """Collects the state dict and dump to disk."""
+    if save_peft:
+        trainer.model = trainer.model.cpu()
+        trainer.model.save_pretrained(output_dir)
+    else:
+        state_dict = trainer.model.state_dict()
+        if trainer.args.should_save:
+            cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
+            del state_dict
+            trainer._save(output_dir, state_dict=cpu_state_dict)
+    
+
+def smart_tokenizer_and_embedding_resize(
+    special_tokens_dict: Dict,
+    tokenizer: transformers.PreTrainedTokenizer,
+    model: transformers.PreTrainedModel,
+):
+    """Resize tokenizer and embedding.
+    
+    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
+    """
+    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
+    model.resize_token_embeddings(len(tokenizer))
+
+    if num_new_tokens > 0:
+        input_embeddings = model.get_input_embeddings().weight.data
+        output_embeddings = model.get_output_embeddings().weight.data
+
+        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
+        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
+
+        input_embeddings[-num_new_tokens:] = input_embeddings_avg
+        output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+
+def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
+    """Tokenize a list of strings."""
+    tokenized_list = [
+        tokenizer(
+            text,
+            return_tensors="pt",
+            padding="longest",
+            max_length=tokenizer.model_max_length,
+            truncation=True,
+        )
+        for text in strings
+    ]
+    input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
+    input_ids_lens = labels_lens = [
+        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
+    ]
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+        input_ids_lens=input_ids_lens,
+        labels_lens=labels_lens,
+    )
+
+
+def preprocess(
+    sources: Sequence[str],
+    targets: Sequence[str],
+    tokenizer: transformers.PreTrainedTokenizer,
+    mask_input: bool
+) -> Dict:
+    """Preprocess the data by tokenizing."""
+    examples = [s + t for s, t in zip(sources, targets)]
+    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
+    input_ids = examples_tokenized["input_ids"]
+    labels = copy.deepcopy(input_ids)
+    if mask_input:
+        for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
+            label[:source_len] = IGNORE_INDEX
+    return dict(input_ids=input_ids, labels=labels)
+
+
+class SupervisedDataset(Dataset):
+    """Dataset for supervised fine-tuning."""
+    
+    def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, mask_input: bool):
+        super(SupervisedDataset, self).__init__()
+        logging.warning("Loading data...")
+        self.list_data_dict = datasets.load_dataset('json',data_files=data_path)['train']
+        self.tokenizer = tokenizer
+        self.mask_input = mask_input
+        
+    def __len__(self):
+        return len(self.list_data_dict)
+
+    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
+        example=self.list_data_dict[i]
+        source = example['query']
+        if 'qwen' in self.tokenizer.name_or_path.lower():
+            target = f"{example['response']}"
+        else:
+            target = f"{example['response']}{self.tokenizer.eos_token}"
+        
+        data_dict = preprocess([source], [target], self.tokenizer, self.mask_input)
+        
+        input_ids = data_dict["input_ids"][0]
+        labels = data_dict["labels"][0]
+        return dict(input_ids=input_ids, labels=labels)
+
+
+@dataclass
+class DataCollatorForSupervisedDataset(object):
+    """Collate examples for supervised fine-tuning."""
+    
+    tokenizer: transformers.PreTrainedTokenizer
+    
+    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
+        input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
+        )
+        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
+        return dict(
+            input_ids=input_ids,
+            labels=labels,
+            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
+        )
+
+
+def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
+    """Make dataset and collator for supervised fine-tuning."""
+    train_dataset = SupervisedDataset(tokenizer=tokenizer, data_path=data_args.train_data_path, mask_input=data_args.mask_input)
+    dev_dataset = SupervisedDataset(tokenizer=tokenizer, data_path=data_args.dev_data_path, mask_input=data_args.mask_input)
+    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
+    return dict(train_dataset=train_dataset, eval_dataset=dev_dataset, data_collator=data_collator)
+
+def postprocess_text(preds, labels):
+    preds = [pred.strip() for pred in preds]
+    labels = [label.strip() for label in labels]
+
+    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
+    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]
+
+    return preds, labels
+
+
+def train():
+    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if 'qwen' not in model_args.model_name_or_path.lower():
+        training_args.predict_with_generate=True
+    
+    model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path,
+        cache_dir=training_args.cache_dir,
+        trust_remote_code=True
+    )
+    
+    print("load model")
+    if training_args.train_with_peft:
+        print("lora")
+        from peft import get_peft_model, LoraConfig, TaskType
+        peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
+        model = get_peft_model(model, peft_config)
+    
+    try:
+        tokenizer = transformers.AutoTokenizer.from_pretrained(
+            model_args.tokenizer_name_or_path,
+            cache_dir=training_args.cache_dir,
+            model_max_length=training_args.model_max_length,
+            padding_side="right",
+            use_fast=False,
+            trust_remote_code=True
+        )
+    except:
+        tokenizer = transformers.LlamaTokenizer.from_pretrained(
+            model_args.tokenizer_name_or_path,
+            cache_dir=training_args.cache_dir,
+            model_max_length=training_args.model_max_length,
+            padding_side="right",
+            use_fast=False
+        )
+    if 'qwen' not in model_args.model_name_or_path.lower():
+        special_tokens_dict = dict()
+        if tokenizer.pad_token is None:
+            special_tokens_dict["pad_token"] = DEFAULT_PAD_TOKEN
+        if tokenizer.eos_token is None:
+            special_tokens_dict["eos_token"] = DEFAULT_EOS_TOKEN
+        if tokenizer.bos_token is None:
+            special_tokens_dict["bos_token"] = DEFAULT_BOS_TOKEN
+        if tokenizer.unk_token is None:
+            special_tokens_dict["unk_token"] = DEFAULT_UNK_TOKEN
+        
+        smart_tokenizer_and_embedding_resize(
+            special_tokens_dict=special_tokens_dict,
+            tokenizer=tokenizer,
+            model=model,
+        )
+    else:
+        tokenizer.pad_token_id = tokenizer.eod_id
+    
+    
+    data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
+    trainer = Seq2SeqTrainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
+    trainer.train()
+    trainer.save_state()
+    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir, save_peft=training_args.train_with_peft)
+    
+
+if __name__ == "__main__":
+    train()
@@ -0,0 +1,44 @@
+model_name_or_path=$1
+modelname=$2
+layername=$3
+
+data_name="text_pairs.json"
+mask_input=true
+ep=1
+save_steps=1000
+lr=5e-5
+bsz=16
+gas=4
+card=4
+worker=64
+
+out_dir=outputs_pretrain_${modelname}
+mkdir -p ${out_dir}
+echo ${out_dir}
+
+torchrun --nproc_per_node=${card} --master_port=28039 pretrain.py \
+    --model_name_or_path ${model_name_or_path} \
+    --tokenizer_name_or_path ${model_name_or_path} \
+    --train_data_path ./datasets/${data_name} \
+    --model_max_length 512 \
+    --output_dir ${out_dir} \
+    --num_train_epochs ${ep} \
+    --per_device_train_batch_size ${bsz} \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps ${gas} \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps ${save_steps} \
+    --learning_rate ${lr} \
+    --weight_decay 0. \
+    --warmup_ratio 0.1 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --mask_input ${mask_input} \
+    --dataloader_num_workers ${worker} \
+    --bf16 True \
+    --tf32 True \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_transformer_layer_cls_to_wrap ${layername}
+
+echo ${out_dir}
@@ -0,0 +1,377 @@
+#    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+from typing import Optional, Tuple, Union
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+import copy
+import logging
+from dataclasses import dataclass, field
+import json
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union,Sequence
+
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, LayerNorm, MSELoss
+import datasets
+import numpy as np
+import torch
+import transformers
+from torch.utils.data import Dataset
+from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, BloomForCausalLM, LlamaTokenizer
+import random
+# import evaluate
+import utils
+import os
+# metric = evaluate.load("rouge")
+
+IGNORE_INDEX = -100
+DEFAULT_PAD_TOKEN = "[PAD]"
+DEFAULT_EOS_TOKEN = "</s>"
+DEFAULT_BOS_TOKEN = "</s>"
+DEFAULT_UNK_TOKEN = "</s>"
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
+    ref_path: Optional[str] = field(default="facebook/opt-125m")
+    tokenizer_name_or_path: Optional[str] = field(default="facebook/opt-125m")
+    temperature: Optional[float] = field(default=1.0)
+
+    top: int = field(default=24)
+
+    w_frozen: Optional[bool] = field(default=True)
+
+@dataclass
+class DataArguments:
+    train_data_path: str = field(default=None, metadata={"help": "Path to the training data."})
+    train_group_size: int = field(default=-1)
+    len_query: int = field(default=64)
+    len_doc: int = field(default=438)
+
+@dataclass
+class TrainingArguments(transformers.Seq2SeqTrainingArguments):
+    cache_dir: Optional[str] = field(default=None)
+    optim: str = field(default="adamw_torch")
+    model_max_length: int = field(
+        default=2048,
+        metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
+    )
+
+
+def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        labels_gen: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+        if deprecated_arguments.pop("position_ids", False) is not False:
+            # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+            warnings.warn(
+                "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore"
+                " passing `position_ids`.",
+                FutureWarning,
+            )
+        if len(deprecated_arguments) > 0:
+            raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        
+        lm_logits = self.lm_head(hidden_states)
+        
+        with torch.no_grad():
+            init_lm_logits = self.init_model(input_ids=input_ids,attention_mask=attention_mask)[0]
+
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            device = lm_logits.device
+            labels = labels.to(device)
+            labels_gen = labels_gen.to(device)
+            indexs=(labels!=-100).long()
+            label_no_ingore = torch.where(labels==-100,torch.ones(labels.shape).long().to(device),labels)
+
+            preds = torch.nn.functional.log_softmax(lm_logits,dim=-1) #BLV
+            logprobs = torch.gather(preds, -1, label_no_ingore.unsqueeze(dim=-1)).squeeze(dim=-1) # B L
+            scores = (logprobs*indexs).sum(dim=-1)/indexs.sum(dim=-1) #B -> bsz*group
+
+
+            scores = torch.exp(scores).view(-1,self.train_group_size)/self.temperature # bsz,group
+
+            
+            target_label=torch.zeros(scores.shape[0], dtype=torch.long).to(device)
+            loss1 = self.cross_entropy(scores, target_label)
+            
+            # generation loss
+            _,seq_length,vocab_size = lm_logits.shape
+            pos_labels = labels_gen.view(-1,self.train_group_size,seq_length)[:,0] #BL
+            pos_lm_logits = lm_logits.view(-1,self.train_group_size, seq_length, vocab_size)[:,0]
+
+            loss2 = self.cross_entropy(
+                pos_lm_logits.reshape(-1, vocab_size), pos_labels.reshape(-1)
+            )
+            
+            # kl
+            loss3 = self.kl_loss(input=preds.reshape([-1,vocab_size]), target=init_lm_logits.softmax(dim=-1).reshape([-1,vocab_size]))
+            
+            loss = loss1 + loss2 + loss3
+            
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
+    """Collects the state dict and dump to disk."""
+    state_dict = trainer.model.state_dict()
+    if trainer.args.should_save:
+        cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
+        del state_dict
+        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
+
+def smart_tokenizer_and_embedding_resize(
+    special_tokens_dict: Dict,
+    tokenizer: transformers.PreTrainedTokenizer,
+    model: transformers.PreTrainedModel,
+):
+    """Resize tokenizer and embedding.
+
+    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
+    """
+    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
+    model.resize_token_embeddings(len(tokenizer))
+
+    if num_new_tokens > 0:
+        input_embeddings = model.get_input_embeddings().weight.data
+        output_embeddings = model.get_output_embeddings().weight.data
+
+        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
+        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
+
+        input_embeddings[-num_new_tokens:] = input_embeddings_avg
+        output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+class SupervisedDataset(Dataset):
+    def __init__(self, data, train_group_size, tokenizer, len_query, len_doc):
+        self.data = data
+        self.train_group_size=train_group_size
+        self.tokenizer = tokenizer
+        self.len_query=len_query
+        self.len_doc=len_doc
+
+    def __len__(self):
+        return len(self.data)
+        
+    def __getitem__(self, idx):
+        ex = self.data[idx]
+        all_qd = []
+
+        if len(ex['negative_passages'])<self.train_group_size-1:
+            all_qd = random.choices(ex['negative_passages'], k=self.train_group_size-1)
+        else:
+            all_qd = random.sample(ex['negative_passages'], self.train_group_size-1)
+        
+        all_qd = [random.choice(ex['positive_passages'])] + all_qd
+        
+        def truncation(text,length):
+            text=self.tokenizer.decode(self.tokenizer.encode(text,max_length=length, add_special_tokens=False))
+            return text
+        
+
+        query = truncation(ex['query'], self.len_query).replace(self.tokenizer.pad_token,'PAD')
+        all_doc = [truncation(qd['text'], self.len_doc).replace(self.tokenizer.pad_token,'PAD') for qd in all_qd]
+        
+        input_prompt = 'Document: {passage} Query:'
+        
+        sources = [input_prompt.format(passage = doc) for doc in all_doc]        
+        targets=[query for _ in sources]
+
+        """Preprocess the data by tokenizing."""
+        examples = [s + t for s, t in zip(sources, targets)]
+        examples_tokenized, sources_tokenized = [self._tokenize_fn(strings) for strings in (examples, sources)]
+        input_ids = examples_tokenized["input_ids"]
+        labels = copy.deepcopy(input_ids)
+        labels_gen = copy.deepcopy(input_ids)
+        for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
+            label[:source_len] = IGNORE_INDEX
+        assert len(input_ids)==len(labels)        
+
+        return dict(input_ids=input_ids, labels=labels, labels_gen=labels_gen)
+
+    def _tokenize_fn(self, strings: Sequence[str]) -> Dict:
+        """Tokenize a list of strings."""
+        tokenized_list = [
+            self.tokenizer(
+                text,
+                return_tensors="pt",
+                padding="longest",
+                max_length=self.tokenizer.model_max_length,
+                truncation=True,
+            )
+            for text in strings
+        ]
+        input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
+        input_ids_lens = labels_lens = [
+            tokenized.input_ids.ne(self.tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list
+        ]
+        return dict(
+            input_ids=input_ids,
+            labels=labels,
+            input_ids_lens=input_ids_lens,
+            labels_lens=labels_lens,
+        )
+
+@dataclass
+class DataCollatorForSupervisedDataset(object):
+    """Collate examples for supervised fine-tuning."""
+    tokenizer: transformers.PreTrainedTokenizer
+    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
+        input_ids, labels, labels_gen = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels", "labels_gen"))
+        input_ids=[item for sublist in input_ids for item in sublist]
+        labels=[item for sublist in labels for item in sublist] 
+        labels_gen=[item for sublist in labels_gen for item in sublist] 
+
+        for index in range(len(input_ids)):
+            input_ids[index]=input_ids[index][:-1]
+
+        input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
+        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
+        labels_gen = torch.nn.utils.rnn.pad_sequence(labels_gen, batch_first=True, padding_value=IGNORE_INDEX)
+
+        labels = labels[..., 1:].contiguous() #BL
+        labels_gen = labels_gen[..., 1:].contiguous() #BL
+        return dict(
+            input_ids=input_ids,
+            labels=labels,
+            labels_gen=labels_gen,
+            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
+        )
+
+
+def train():
+    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    training_args.predict_with_generate=True
+    
+    model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.model_name_or_path,
+        cache_dir=training_args.cache_dir,
+    )
+    model.bsz = training_args.per_device_train_batch_size
+    model.train_group_size = data_args.train_group_size
+    model.cross_entropy = torch.nn.CrossEntropyLoss(reduction='mean')
+    model.kl_loss = torch.nn.KLDivLoss(reduction="batchmean")
+
+    model.temperature = model_args.temperature
+
+    model.init_model = transformers.AutoModelForCausalLM.from_pretrained(
+        model_args.ref_path,
+        cache_dir=training_args.cache_dir
+    ).eval()
+    
+    if model_args.w_frozen:
+        # peft
+        for name, param in model.named_parameters():
+            param.requires_grad = False
+        
+        for name, param in model.transformer.h[-1*model_args.top:].named_parameters():
+            param.requires_grad = True
+    
+    from functools import partial
+    model.forward = partial(forward, model)
+
+    if 'llama' in model_args.tokenizer_name_or_path.lower():
+        tokenizer = LlamaTokenizer.from_pretrained(
+            model_args.tokenizer_name_or_path,
+            cache_dir=training_args.cache_dir,
+            model_max_length=training_args.model_max_length,
+            padding_side="right",
+            use_fast=False,
+        )
+    else:
+        tokenizer = transformers.AutoTokenizer.from_pretrained(
+            model_args.tokenizer_name_or_path,
+            cache_dir=training_args.cache_dir,
+            model_max_length=training_args.model_max_length,
+            padding_side="right",
+            use_fast=False,
+        )
+
+    if tokenizer.pad_token is None:
+        smart_tokenizer_and_embedding_resize(
+            special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
+            tokenizer=tokenizer,
+            model=model,
+        )
+    if "llama" in model_args.model_name_or_path:
+        tokenizer.add_special_tokens(
+            {
+                "eos_token": DEFAULT_EOS_TOKEN,
+                "bos_token": DEFAULT_BOS_TOKEN,
+                "unk_token": DEFAULT_UNK_TOKEN,
+            }
+        )
+
+    data = datasets.load_dataset('json',data_files=data_args.train_data_path)['train']
+    
+
+    train_dataset = SupervisedDataset(data=data, train_group_size=data_args.train_group_size,tokenizer=tokenizer,len_query=data_args.len_query,len_doc=data_args.len_doc)
+    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
+    
+    trainer = Seq2SeqTrainer(model=model, tokenizer=tokenizer, args=training_args, train_dataset=train_dataset, data_collator=data_collator)
+    trainer.train()
+    trainer.save_state()
+    safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
+    
+
+if __name__ == "__main__":
+    train()
@@ -0,0 +1,52 @@
+ep=1
+lr=3e-5
+bsz=1
+group=48
+gas=4
+card=8
+workers=64
+save_steps=1000
+data_name=msmarco.json
+temperature=0.001
+ref_path=$1
+ref_name=$2
+top=$3
+layername=$4
+
+out_dir="outputs_sft_${ref_name}"
+
+echo ${out_dir}
+mkdir -p ${out_dir}
+
+torchrun --nproc_per_node=${card} --master_port=29405 sft.py \
+    --model_name_or_path ${ref_path} \
+    --tokenizer_name_or_path ${ref_path} \
+    --train_data_path ./datasets/${data_name} \
+    --model_max_length 512 \
+    --output_dir ${out_dir} \
+    --num_train_epochs ${ep} \
+    --per_device_train_batch_size ${bsz} \
+    --per_device_eval_batch_size 1 \
+    --gradient_accumulation_steps ${gas} \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps ${save_steps} \
+    --learning_rate ${lr} \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --train_group_size ${group} \
+    --dataloader_num_workers ${workers} \
+    --temperature ${temperature} \
+    --len_query 32 \
+    --len_doc 128 \
+    --ref_path ${ref_path} \
+    --only_query ${only_query} \
+    --top ${top} \
+    --bf16 True \
+    --tf32 True \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_transformer_layer_cls_to_wrap ${layername}
+
+echo ${out_dir}