Machine learning/NLP
[NLP]. HuggingFace Tokenizer에 token 추가(add)하기
Acdong
2023. 2. 27. 15:36
728x90
from transformers import AutoTokenizer
tokenzer = AutoTokenizer.from_pretrained({model_path})
# new tokens
new_tokens = "[NEW]"
tokenizer.add_special_tokens({"additional_special_tokens" : [new_tokens]})
model.resize_token_embeddings(len(tokenizer))
resize_token_embeddings(len(tokenizer))를 안해주게 되면 임베딩 에러 발생
+ 추가
Sbert의 경우
# ADD tokens
tokens = ["[NEW]"]
embedding_model = model._first_module()
embedding_model.tokenizer.add_tokens(tokens, special_tokens=True)
embedding_model.auto_model.resize_token_embeddings(
len(embedding_model.tokenizer))
pooling_model = models.Pooling(
embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[embedding_model, pooling_model])
반응형