Document Classification

기계학습이론과실습 2022. 4. 25. 13:41

Web scraping
Preprocessing
POS tagging(Part-of-speech tagging)
불용어가 제거된 특정 품사 단어들만 선택
Representation (Vectorization)
Bag of words model
TF-IDF
Applying ML algorithms for training data

이 중 3번 Representation 작업에 대해 알아보겠다.

* Bag of words model

각 단어는 하나의 feature가 되며, 각 feature의 값은 해당 단어의 사용빈도가 된다.

그러나 해당 방법은 각 단어가 해당 문서에서 갖는 상대적 중요성은 표현하지 못한다.

* TF-IDF(term frequency-inverse document frequency)

특정 단어가 특정 문서의 uniqueness를 얼마나 나타내는가를 계산하기 위해 사용된다.

TF-IDF가 높을수록 해당 단어는 다른 문서에서는 적게 사용되고, 해당 문서에서 많이 사용되고 있다는 의미이다.

DF = corpus에 존재하는 전체 문서 내에서 해당 문서를 제외한 나머지 문서 중 해당 단어가 사용된 문서 개수의 합

IDF(t) = log (1+n)/(1+df(t)) + 1

from sklearn.feature_extraction.text import CountVectorizer # frequency based DTM
from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf based DTM

TEXT = ['banana apple apple eggplant',
        'orange carrot banana eggplant',
        'apple carrot banana banana',
        'orange banana grape'
]

# using frequency based DTM for the data analysis

tf_vectorizer = CountVectorizer()
tf_features = tf_vectorizer.fit_transform(TEXT)

# using tf-idf based DTM for the data analysis

tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(TEXT)

# check featrues of a matrix

# todense() returns a matrix (toarray() returns an ndarray)

features = np.array(tf_features.todense())
features

# check feature names

# vectorizer.get_feature_names_out() returns feature names

feature_names = tf_vectorizer.get_feature_names_out()
feature_names

# check features as a data frame format

df = pd.DataFrame(data=features, columns=feature_names)
print(df)

# 유클리디안 유사도

np.linalg.norm(features[1]-features[0])

np.linalg.norm(features[1]-features[2])

# 코사인 유사도

np.dot(features[0], features[1])/(np.linalg.norm(features[0])*np.linalg.norm(features[1]))

np.dot(features[0], features[2])/(np.linalg.norm(features[0])*np.linalg.norm(features[2]))

4. Applying ML Algorithms for training data

사실 Sentiment Analysis(감성분석)를 위해서는 두 가지 방법이 존재한다.

Lexicon based method (감성어 사전 기반)
- 한글의 경우 성능이 좋은 한글 사전이 없어 사용에 제한이 있다.
Supervised learning method
- 어떤 단어들이 나왔을 때 문서가 긍정 혹은 부정일 확률이 높은지를 계산한다.
- Label이 없는 문서에 해당 확률을 적용하여 Label을 추정한다.
- Logistic regression, Support vector machine, Decision tree, Naive Bayes, 신경망 기반 모형 등을 사용할 수 있다.
- Decision tree의 경우 feature의 수가 많으면 성능이 떨어지므로 사용하지 않는다.
- SVM의 경우 모델을 구축하는데 시간이 오래 걸리고 성능도 비교적 좋지 않아 사용하지 않는다.
- Logistic regression 혹은 신경망 기반 모형(CNN, LSTM, Attention based Transformer BERT 등)을 주로 사용한다.

import pandas as pd
import numpy as np

# with open: with문 사용시 자동적으로 with문을 나올 때 자동적으로 close문 적용

# strip(): 문자열의 맨 앞과 맨 뒤의 whitespace 삭제

# split(\t): \t를 기준으로 문자열을 나눔

## list: [1, 2, 3] tuple: (1, 2, 3) dictionary {'나이': 33, '직업': '프로그래머'}

## refer to https://bskyvision.com/854

with open('Korean_movie_reviews_2016.txt', encoding='utf-8') as f:
docs = [doc.strip().split('\t') for doc in f]

[['부산 행 때문 너무 기대하고 봤', '0'],
 ['한국 좀비 영화 어색하지 않게 만들어졌 놀랍', '1'],
 ['조금 전 보고 왔 지루하다 언제 끝나 이 생각 드', '0']]

docs = [(doc[0], int(doc[1])) for doc in docs if len(doc) == 2]

# To read the second and third column info from each row

[('부산 행 때문 너무 기대하고 봤', 0),
 ('한국 좀비 영화 어색하지 않게 만들어졌 놀랍', 1),
 ('조금 전 보고 왔 지루하다 언제 끝나 이 생각 드', 0)]

texts, labels = zip(*docs)
# 둘을 분리해서 별도의 list 변수로 저장

# label 값이 1인 관측치의 비율 확인

sum(labels)/len(labels)

from sklearn.model_selection import train_test_split
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.1, random_state=0)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# train data를 위한 vectorization 모델을 test data에도 적용하는 점 유의

tf_vectorizer = CountVectorizer()
tf_train_features = tf_vectorizer.fit_transform(train_texts)
tf_test_features = tf_vectorizer.transform(test_texts)

# 단어를 순서대로 배열한 후 보기

# key=lambda x:x[1]: for each element x in mylist, return the second index of that element, then sort all of the elements of the original list mylist by the sorted order of the list calculated by the lambda function

vocablist = [word for word, _ in sorted(tf_vectorizer.vocabulary_.items(), key=lambda x:x[1])]

vocablist[:10]

from sklearn.linear_model import LogisticRegression

lr_tf = LogisticRegression(max_iter=1000)
lr_tf.fit(tf_train_features, train_labels)
pred_labels = lr_tf.predict(tf_test_features)

from sklearn.metrics import accuracy_score
print('Misclassified samples: {} out of {}'.format((pred_labels != test_labels).sum(),len(test_labels)))
print('Accuracy: %.2f' % accuracy_score(test_labels, pred_labels))

Misclassified samples: 1837 out of 16539
Accuracy: 0.89

# Get coefficients of the model
coefficients = lr_tf.coef_.tolist()

sorted_coefficients = sorted(enumerate(coefficients[0]), key=lambda x:x[1], reverse=True)
# 학습에 사용된 각 단어마다의 coefficient (즉 weight) 값이 존재
# coefficient값이 큰 순으로 정렬 'reverse=True'

print(sorted_coefficients[:5])
# print top 50 positive words
for word, coef in sorted_coefficients[:50]:
print('{0:} ({1:.3f})'.format(vocablist[word], coef))
# print top 50 negative words
for word, coef in sorted_coefficients[-50:]:
print('{0:} ({1:.3f})'.format(vocablist[word], coef))

[(5803, 7.377382116200149), (35234, 6.776375979162454), (35197, 6.605370152185033), (28723, 6.410074819726493), (40640, 6.231654780306138)]
꿀잼 (7.377)
재밌었 (6.776)
재밌게 (6.605)

'기계학습이론과실습' 카테고리의 다른 글

Course Intro (0)	2022.05.04
Naive Bayes (0)	2022.05.03
Imbalanced Classification (0)	2022.04.25
Hyper-parameter Tuning (0)	2022.04.20
Logistic Regression (0)	2022.04.18

ABOUT ME

동산 동산

'기계학습이론과실습' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'기계학습이론과실습' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바