Skip to content

[BUG] #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kotran88 opened this issue May 30, 2023 · 2 comments
Open

[BUG] #107

kotran88 opened this issue May 30, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@kotran88
Copy link

kotran88 commented May 30, 2023

๐Ÿ› Bug

๊ธฐ๋ณธ ๊ฐ์ •๋ถ„๋ฅ˜ ์˜ˆ์™€
https://www.dinolabs.ai/271
์ด ์˜ˆ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๊ฒ€์ƒ‰ํ•˜๋ฉฐ ์˜ค๋ฅ˜ ์ˆ˜์ •ํ•˜๋ฉด์„œ ํ•˜๊ณ ์žˆ์Šต๋‹ˆ๋‹ค..

colab์œผ๋กœ ํ•˜๊ณ ์žˆ์Šต๋‹ˆ๋‹ค.

!pip install mxnet
!pip install gluonnlp pandas tqdm
!pip install sentencepiece==0.1.91
!pip install transformers==4.8.2
!pip install torch
!pip install gluonnlp==0.10.0

!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook
from kobert_tokenizer import KoBERTTokenizer
from transformers import BertModel
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm.notebook import tqdm
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
bertmodel = BertModel.from_pretrained('skt/kobert-base-v1', return_dict=False)
vocab = nlp.vocab.BERTVocab.from_sentencepiece(tokenizer.vocab_file, padding_token='[PAD]')

def get_kobert_model(model_path, vocab_file, ctx="cpu"):
    bertmodel = BertModel.from_pretrained(model_path)
    device = torch.device(ctx)
    bertmodel.to(device)
    bertmodel.eval()
    vocab_b_obj = nlp.vocab.BERTVocab.from_sentencepiece(vocab_file,
                                                         padding_token='[PAD]')
    return bertmodel, vocab_b_obj
bertmodel, vocab = get_kobert_model('skt/kobert-base-v1',tokenizer.vocab_file)

from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
chatbot_data = pd.read_excel('drive/MyDrive/korean.xlsx')

from sklearn.model_selection import train_test_split
dataset_train, dataset_test = train_test_split(data_list, test_size=0.25, random_state=0)
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)
        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]
        
    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)
data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False)
data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

๋งˆ์ง€๋ง‰๋ถ€๋ถ„์—์„œ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.


TypeError                                 Traceback (most recent call last)
[<ipython-input-60-1574bbdbfa0b>](https://localhost:8080/#) in <cell line: 2>()
      1 tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)
----> 2 data_train = BERTDataset(dataset_train, 0, 1, tok, max_len, True, False)
      3 data_test = BERTDataset(dataset_test, 0, 1, tok, max_len, True, False)

10 frames
[/usr/local/lib/python3.10/dist-packages/sentencepiece/__init__.py](https://localhost:8080/#) in LoadFromFile(self, arg)
    308 
    309     def LoadFromFile(self, arg):
--> 310         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    311 
    312     def _EncodeAsIds(self, text, enable_sampling, nbest_size, alpha, add_bos, add_eos, reverse, emit_unk_piece):

TypeError: not a string
## To Reproduce
<!-- ๋งŒ์•ฝ์— ์ฝ”๋“œ ์ƒ˜ํ”Œ, ์—๋Ÿฌ ๋ฉ”์„ธ์ง€, ์Šคํƒ ํŠธ๋ ˆ์ด์Šค ๋“ฑ์ด ์žˆ๋‹ค๋ฉด ์ด๋ฅผ ์ฒจ๋ถ€ํ•ด์ฃผ์„ธ์š”-->

๋ฒ„๊ทธ๋ฅผ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์žฌํ˜„์ ˆ์ฐจ๋ฅผ ์ž‘์„ฑํ•ด์ฃผ์„ธ์š”.

1. -
2. -
3. -

## Expected behavior
<!-- ๋ฒ„๊ทธ๊ฐ€ ๋ฐœ๊ฒฌ๋˜๊ธฐ ์ด์ „์— ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ–ˆ์„ ๊ฒฝ์šฐ์— ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ƒํ–ˆ๋Š”์ง€ ์ž‘์„ฑํ•ด์ฃผ์„ธ์š”.-->

## Environment
google colab tpu 

## Additional context
<!-- ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๊ฐ€ ์žˆ๋‹ค๋ฉด ์„œ์ˆ ํ•ด์ฃผ์„ธ์š”.-->

@kotran88 kotran88 added the bug Something isn't working label May 30, 2023
@ChangZero
Copy link

@kotran88
์•„๋ž˜ ์ฝ”๋“œ ์ฐธ๊ณ ํ•ด๋ณด์‹œ๋ฉด ์ข‹์„๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค.
https://github.com/ChangZero/koBERT-finetuning-demo/blob/main/kobert_colab.ipynb

@Jhyunee
Copy link

Jhyunee commented Mar 29, 2024

ํ˜น์‹œ ํ•ด๊ฒฐํ•˜์…จ๋‚˜์š”? ์ €๋„ ๊ฐ™์€ ์—๋Ÿฌ๋ฅผ ๋ชป๊ณ ์น˜๊ณ  ์žˆ์–ด์„œ,,ใ… ใ… 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants