[Python 머신러닝] 08-8 문서 유사도

2024-05-22 5 분 소요

텍스트 분석

문서 유사도

문서와 문서 간의 유사도를 측정

문서 유사도 측정 지표

Cosion Similarity
Jaccard Similarity
Manhattan Distance
Euclidean Distance

문서 유사도 측정 방법 - 코사인 유사도

코사인 유사도는 벡터와 벡터 간의 유사도를 비교할 때 벡터의 크기보다는 벡터의 상호 방향성이 얼마나 유사한지에 기반한다.
즉, 코사인 유사도는 두 벡터 사이의 사잇각을 구해서 얼마나 유사한지 수치로 적용한 것이다.

두 벡터 사잇각

두 벡터의 사잇각에 따라서 상호 관계는 다음과 같이 유사하거나 관련이 없거나 아예 반대 관계가 될 수 있다.

스크린샷 2024-05-22 141106

피처 벡터 행렬은 음수값이 없으므로 코사인 유사도가 음수값이 나타나지 않는다.
따라서 코사인 유사도는 0~1 사이 값을 가진다. (1로 갈수록 유사)

$A * B$ = $

\cos{\theta}$

유사도(similarity) = $\cos{\theta} = \frac{A \cdot B}{

} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \, \sqrt{\sum_{i=1}^{n} B_i^2}}$

사이킷런 cosine_similarity( )

sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True)

-> Pairwise(쌍) 형태로 각 문서와 문서끼리의 코사인 유사도를 행렬로 반환

Opinion Review 데이터 세트를 이용한 문서 유사도 측정

<실습>

import sklearn
print(sklearn.__version__)

1.0.2

import numpy as np

def cos_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    l2_norm = (np.sqrt(sum(np.square(v1))) * np.sqrt(sum(np.square(v2))))
    similarity = dot_product / l2_norm     
    
    return similarity

from sklearn.feature_extraction.text import TfidfVectorizer

doc_list = ['if you take the blue pill, the story ends' ,
            'if you take the red pill, you stay in Wonderland',
            'if you take the red pill, I show you how deep the rabbit hole goes']

tfidf_vect_simple = TfidfVectorizer()
feature_vect_simple = tfidf_vect_simple.fit_transform(doc_list)
print(feature_vect_simple.shape)

(3, 18)

feature_vect_dense = feature_vect_simple.todense()
feature_vect_dense

matrix([[0.4155636 , 0.        , 0.4155636 , 0.        , 0.        ,
         0.        , 0.24543856, 0.        , 0.24543856, 0.        ,
         0.        , 0.        , 0.        , 0.4155636 , 0.24543856,
         0.49087711, 0.        , 0.24543856],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.23402865, 0.39624495, 0.23402865, 0.        ,
         0.3013545 , 0.        , 0.39624495, 0.        , 0.23402865,
         0.23402865, 0.39624495, 0.4680573 ],
        [0.        , 0.30985601, 0.        , 0.30985601, 0.30985601,
         0.30985601, 0.18300595, 0.        , 0.18300595, 0.30985601,
         0.23565348, 0.30985601, 0.        , 0.        , 0.18300595,
         0.3660119 , 0.        , 0.3660119 ]])

np.array(feature_vect_dense[0]).reshape(-1,)

array([0.4155636 , 0.        , 0.4155636 , 0.        , 0.        ,
       0.        , 0.24543856, 0.        , 0.24543856, 0.        ,
       0.        , 0.        , 0.        , 0.4155636 , 0.24543856,
       0.49087711, 0.        , 0.24543856])

# TFidfVectorizer로 transform()한 결과는 Sparse Matrix이므로 Dense Matrix로 변환. 
feature_vect_dense = feature_vect_simple.todense()

#첫번째 문장과 두번째 문장의 feature vector  추출
vect1 = np.array(feature_vect_dense[0]).reshape(-1,)
vect2 = np.array(feature_vect_dense[1]).reshape(-1,)

#첫번째 문장과 두번째 문장의 feature vector로 두개 문장의 Cosine 유사도 추출
similarity_simple = cos_similarity(vect1, vect2 )
print('문장 1, 문장 2 Cosine 유사도: {0:.3f}'.format(similarity_simple))

문장 1, 문장 2 Cosine 유사도: 0.402

vect1 = np.array(feature_vect_dense[0]).reshape(-1,)
vect3 = np.array(feature_vect_dense[2]).reshape(-1,)
similarity_simple = cos_similarity(vect1, vect3 )
print('문장 1, 문장 3 Cosine 유사도: {0:.3f}'.format(similarity_simple))

vect2 = np.array(feature_vect_dense[1]).reshape(-1,)
vect3 = np.array(feature_vect_dense[2]).reshape(-1,)
similarity_simple = cos_similarity(vect2, vect3 )
print('문장 2, 문장 3 Cosine 유사도: {0:.3f}'.format(similarity_simple))

문장 1, 문장 3 Cosine 유사도: 0.404
문장 2, 문장 3 Cosine 유사도: 0.456

from sklearn.metrics.pairwise import cosine_similarity

similarity_simple_pair = cosine_similarity(feature_vect_simple[0] , feature_vect_simple)
print(similarity_simple_pair)

[[1.         0.40207758 0.40425045]]

from sklearn.metrics.pairwise import cosine_similarity

similarity_simple_pair = cosine_similarity(feature_vect_simple[0] , feature_vect_simple[1:])
print(similarity_simple_pair)

[[0.40207758 0.40425045]]

similarity_simple_pair = cosine_similarity(feature_vect_simple , feature_vect_simple)
print(similarity_simple_pair)
print('shape:',similarity_simple_pair.shape)

[[1.         0.40207758 0.40425045]
 [0.40207758 1.         0.45647296]
 [0.40425045 0.45647296 1.        ]]
shape: (3, 3)

Opinion Review 데이터 셋을 이용한 문서 유사도 측정

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

# 입력으로 들어온 token단어들에 대해서 lemmatization 어근 변환. 
def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

# TfidfVectorizer 객체 생성 시 tokenizer인자로 해당 함수를 설정하여 lemmatization 적용
# 입력으로 문장을 받아서 stop words 제거-> 소문자 변환 -> 단어 토큰화 -> lemmatization 어근 변환. 
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

import pandas as pd
import glob ,os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

#path = r'C:\Users\q\Text\OpinosisDataset1.0\OpinosisDataset1.0\topics'
path = r'C:\Users\82106\Test_project\PerfectGuide-master\8장\data\OpinosisDataset1.0\OpinosisDataset1.0\topics'
all_files = glob.glob(os.path.join(path, "*.data"))     
filename_list = []
opinion_text = []

for file_ in all_files:
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    filename_list.append(filename)
    opinion_text.append(df.to_string())

document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english' , \
                             ngram_range=(1,2), min_df=0.05, max_df=0.85 )
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
document_df['cluster_label'] = cluster_label

document_df.head(51)

	filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	...	0
1	bathroom_bestwestern_hotel_sfo	...	2
2	battery-life_amazon_kindle	...	0
3	battery-life_ipod_nano_8gb	...	0
4	battery-life_netbook_1005ha	...	0
5	buttons_amazon_kindle	...	0
6	comfort_honda_accord_2008	...	1
7	comfort_toyota_camry_2007	...	1
8	directions_garmin_nuvi_255W_gps	...	0
9	display_garmin_nuvi_255W_gps	...	0
10	eyesight-issues_amazon_kindle	...	0
11	features_windows7	...	0
12	fonts_amazon_kindle	...	0
13	food_holiday_inn_london	...	2
14	food_swissotel_chicago	...	2
15	free_bestwestern_hotel_sfo	...	2
16	gas_mileage_toyota_camry_2007	...	1
17	interior_honda_accord_2008	...	1
18	interior_toyota_camry_2007	...	1
19	keyboard_netbook_1005ha	...	0
20	location_bestwestern_hotel_sfo	...	2
21	location_holiday_inn_london	...	2
22	mileage_honda_accord_2008	...	1
23	navigation_amazon_kindle	...	0
24	parking_bestwestern_hotel_sfo	...	2
25	performance_honda_accord_2008	...	1
26	performance_netbook_1005ha	...	0
27	price_amazon_kindle	...	0
28	price_holiday_inn_london	...	2
29	quality_toyota_camry_2007	...	1
30	rooms_bestwestern_hotel_sfo	...	2
31	rooms_swissotel_chicago	...	2
32	room_holiday_inn_london	...	2
33	satellite_garmin_nuvi_255W_gps	...	0
34	screen_garmin_nuvi_255W_gps	...	0
35	screen_ipod_nano_8gb	...	0
36	screen_netbook_1005ha	...	0
37	seats_honda_accord_2008	...	1
38	service_bestwestern_hotel_sfo	...	2
39	service_holiday_inn_london	...	2
40	service_swissotel_hotel_chicago	...	2
41	size_asus_netbook_1005ha	...	0
42	sound_ipod_nano_8gb	headphone jack i got a clear case for it a...	0
43	speed_garmin_nuvi_255W_gps	...	0
44	speed_windows7	...	0
45	staff_bestwestern_hotel_sfo	...	2
46	staff_swissotel_chicago	...	2
47	transmission_toyota_camry_2007	...	1
48	updates_garmin_nuvi_255W_gps	...	0
49	video_ipod_nano_8gb	...	0
50	voice_garmin_nuvi_255W_gps	...	0

hotel_indexes = document_df[document_df['cluster_label']==2].index
hotel_indexes

Int64Index([1, 13, 14, 15, 20, 21, 24, 28, 30, 31, 32, 38, 39, 40, 45, 46], dtype='int64')

from sklearn.metrics.pairwise import cosine_similarity

# cluster_label=2인 데이터는 호텔로 클러스터링된 데이터임. DataFrame에서 해당 Index를 추출
hotel_indexes = document_df[document_df['cluster_label']==2].index
print('호텔로 클러스터링 된 문서들의 DataFrame Index:', hotel_indexes)

# 호텔로 클러스터링된 데이터 중 첫번째 문서를 추출하여 파일명 표시.  
comparison_docname = document_df.iloc[hotel_indexes[0]]['filename']
print('##### 비교 기준 문서명 ',comparison_docname,' 와 타 문서 유사도######')

''' document_df에서 추출한 Index 객체를 feature_vect로 입력하여 호텔 클러스터링된 feature_vect 추출 
이를 이용하여 호텔로 클러스터링된 문서 중 첫번째 문서와 다른 문서간의 코사인 유사도 측정.'''
similarity_pair = cosine_similarity(feature_vect[hotel_indexes[0]] , feature_vect[hotel_indexes])
print(similarity_pair)

호텔로 클러스터링 된 문서들의 DataFrame Index: Int64Index([1, 13, 14, 15, 20, 21, 24, 28, 30, 31, 32, 38, 39, 40, 45, 46], dtype='int64')
##### 비교 기준 문서명  bathroom_bestwestern_hotel_sfo  와 타 문서 유사도######
[[1.         0.0430688  0.05221059 0.06189595 0.05846178 0.06193118
  0.03638665 0.11742762 0.38038865 0.32619948 0.51442299 0.11282857
  0.13989623 0.1386783  0.09518068 0.07049362]]

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# argsort()를 이용하여 앞예제의 첫번째 문서와 타 문서간 유사도가 큰 순으로 정렬한 인덱스 추출하되 자기 자신은 제외. 
sorted_index = similarity_pair.argsort()[:,::-1]
sorted_index = sorted_index[:, 1:]

# 유사도가 큰 순으로 hotel_indexes를 추출하여 재 정렬
hotel_sorted_indexes = hotel_indexes[sorted_index.reshape(-1)]

# 유사도가 큰 순으로 유사도 값을 재정렬하되 자기 자신은 제외
hotel_1_sim_value = np.sort(similarity_pair.reshape(-1))[::-1]
hotel_1_sim_value = hotel_1_sim_value[1:]

# 유사도가 큰 순으로 정렬된 인덱스와 유사도 값을 이용해 파일명과 유사도값을 막대 그래프로 시각화
hotel_1_sim_df = pd.DataFrame()
hotel_1_sim_df['filename'] = document_df.iloc[hotel_sorted_indexes]['filename']
hotel_1_sim_df['similarity'] = hotel_1_sim_value
print('가장 유사도가 큰 파일명 및 유사도:\n', hotel_1_sim_df.iloc[0, :])

sns.barplot(x='similarity', y='filename',data=hotel_1_sim_df)
plt.title(comparison_docname)

가장 유사도가 큰 파일명 및 유사도:
 filename      room_holiday_inn_london
similarity                   0.514423
Name: 32, dtype: object

Text(0.5, 1.0, 'bathroom_bestwestern_hotel_sfo')

output_17_2

Twitter Facebook LinkedIn

Sieun Kim

[Python 머신러닝] 08-8 문서 유사도

텍스트 분석

문서 유사도

문서 유사도 측정 방법 - 코사인 유사도

두 벡터 사잇각

사이킷런 cosine_similarity( )

Opinion Review 데이터 세트를 이용한 문서 유사도 측정

<실습>

Opinion Review 데이터 셋을 이용한 문서 유사도 측정

공유하기

참고

[CS] 1.1 디자인 패턴

[BIGDATA] 3-2 쿼리 엔진

[BIGDATA] 3-1 대규모 분산 처리의 프레임워크

[SQL] 08-3 GUI 응용 프로그램