[NLP]. 텍스트 데이터 정제(이모지 , 특수문자, url , 한자 제거)

import re
import emoji
from soynlp.normalizer import repeat_normalize

pattern = re.compile(f'[^ .,?!/@$%~％·∼()\x00-\x7Fㄱ-ㅣ가-힣]+')
url_pattern = re.compile(
    r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')

def clean(x): 
    x = pattern.sub(' ', x)
    x = emoji.replace_emoji(x, replace='') #emoji 삭제
    x = url_pattern.sub('', x)
    x = x.strip()
    x = repeat_normalize(x, num_repeats=2)
    return x

참고 :

https://github.com/Beomi/KcELECTRA

저작자표시 비영리 (새창열림)

'Machine learning > NLP' 카테고리의 다른 글

[NLP] KakaoGPT 사용해서 존댓말/반말 변환하기 (0)	2023.01.04
[NLP]. SentenceTransformer Tokenize 멀티턴 형식으로 수정하기 (0)	2022.12.22
[NLP]. SentenceTransformer 모델 TensorFlow로 불러오기 (0)	2022.12.12
[NLP]. Sentence-Transformer 모델 onnx 형식으로 변환하기 (0)	2022.12.12
[NLP]. 오타 생성기 구현하기 : Text Noise Augmentation (1)	2022.10.29

'Machine learning > NLP' 카테고리의 다른 글

티스토리툴바