[Python 머신러닝] 08-3 Bag of Words - BOW

2024-05-19 11 분 소요

텍스트 분석

텍스트 피처 벡터화 유형

BOW
Word Embedding (Word2Vec)

Bag of Words - BOW

Bag of Words 모델은 문서가 가지는 모든 단어(Words)를 문맥이나 순서를 무시하고 일괄적으로 단어에 대한 빈도 값을 부여해 피처 값을 추출하는 모델이다.
문서 내 모든 단어를 한꺼번에 봉투(Bag) 안에 넣은 뒤에 흔들어서 섞는다는 의미로 Bag of Words(BOW) 모델이라고 한다.

Bag-of-Words 구조

문장 1: 'My wife likes to watch baseball games and my daughter likes to watch baseball games too'   

문장 2 : 'My wife lies to play baseball'

문장 1과 문장 2에 있는 모든 단어에서 중복을 제거하고 각 단어(feature 똔느 term)를 칼럼 형태로 나열한다.
그러고 나서 각 단어에 고유의 인덱스를 다음과 같이 부여한다.
‘and’: 0, ‘baseball’: 1, ‘daughter’: 2, ‘games’: 3, ‘likes’: 4, ‘my’: 5, ‘play’: 6, ‘to’: 7, ‘too’: 8, ‘watch’: 9, ‘wife’: 10
문장에서 해당 단어가 나타나는 횟수(Occurrence)를 각 단어(단어 인덱스)에 기재한다.
예를 들어 baseball은 문장 1, 2에서 총 2번 나타나며, daughter는 문장 1에서만 1번 나타난다.

BOW 장단점

장점
- 쉽고 빠른 구축
- 예상보다 문서의 특징을 잘 나타내어 전통적으로 여러 분야에서 활용도가 높음
단점
- 문맥 의미(Semantic Context) 반영 문제
- 희소 행렬 문제

BOW 피처 벡터화

M개의 Text 문서들 -> M $\times$ N 피처 벡터화

BOW 피처 벡터화 유형

단순 카운트 기반의 벡터화:
단어 피처에 값을 부여할 때 각 문서에서 해당 단어가 나타나는 횟수(Count)를 부여하는 경우를 카운트 벡터화라고 한다.
카운트 벡터화에서는 카운트 값이 높을수록 중요한 단어로 인식된다.
TF-IDF 벡터화:
카운트만 부여할 경우 그 문서의 특징을 나타내기보다는 언어의 특성상 문장에서 자주 사용될 수밖에 없는 단어까지 높은 값을 부여하게 된다.
이러한 문제를 보완하기 위해 TF-IDF(Tearm Frequency Inverse Document Frequency) 벡터화를 사용한다.
TF-IDF는 개별 문서에서 자주 나타나는 단어에 높은 가중치를 주되, 모든 문서에서 전반적으로 자주 나타나는 단어에 대해서는 페널티를 주는 방식으로 값을 부여한다.

TF-IDF(Tearm Frequency Inverse Document Frequency)

특정 단어가 다른 문서에는 나타나지 않고 특정 문서에서만 자주 사용된다면 해당 단어는 해당 문서를 잘 특징짓는 중요 단어일 가능성이 높음
특정 단어가 매우 많은 여러 문서에서 빈번히 나타난다면 해당 단어는 개별 문서를 특정짓는 정보로소의 의미를 상실

TF(Term Frequency): 문서에서 해당 단어가 얼마나 나왔는지를 나타내는 지표

DF(Document Frequency): 해당 단어가 몇 개의 문서에서 나타났는지를 나타내는 지표

IDF(Inverse Document Frequency): DF의 역수로서 전체 문서수/DF

$TFIDF_i = TF_i \times \log{N \over DF_i}$
($TF_i$ = 개별 문서에서의 단어 $i$ 빈도
$DF_i$ = 단어 $i$를 가지고 있는 문서 개수
$N$ = 전체 문서 개수)

사이킷런의 Count 및 TF-IDF 벡터화 구현: CountVectorizer, TfidfVectorizer

사이킷런 CountVectorizer 초기화 파라미터

파라미터 명	파라미터 설명
max_df	전체 문서에 걸쳐서 너무 높은 빈도수를 가지는 단어 피처를 제외하기 위한 파라미터이다. 너무 높은 빈도수를 가지는 단어는 스톱 워드와 비슷한 문법적인 특성으로 반복적인 단어일 가능성이 높기에 이를 제거하기 위해 사용된다. max_df = 100과 같이 정수 값을 가지면 전체 문서에 걸쳐 100개 이하로 나타나는 단어만 피처로 추출한다. max_df = 0.95와 같이 부동소수점 값(0.0~1.0)을 가지면 전체 문서에 걸쳐 빈도수 0~95%까지의 단어만 피처로 추출하고 나머지 상위 5%는 피처로 추출하지 않는다.
min_df	전체 문서에 걸쳐서 너무 낮은 빈도수를 가지는 단어 피처를 제외하기 위한 파라미터이다. 수백~수천 개의 전체 문서에서 특정 단어가 min_df에 설정된 값보다 적은 빈도수를 가진다면 이 단어는 크게 중요하지 않거나 가비지(garbage)성 단어일 확률이 높다. min_df = 2와 같이 정수 값을 가지면 전체 문서에 걸쳐서 2번 이하로 나타나는 단어는 피처로 추출하지 않는다. min_df = 0.02와 같이 부동소수점 값(0.0~1.0)을 가지면 전체 문서에 걸쳐서 하위 2% 이하의 빈도수를 가지는 단어는 피처로 추출하지 않는다.
max_features	피처로 추출하는 피처의 개수를 제한하면 정수로 값을 지정한다. 가령 max_features = 2000으로 지정할 경우 가장 높은 빈도를 가지는 단어 순으로 정렬해 2000개까지만 피처로 추출한다.
stop_words	‘english’로 지정하면 영어의 스톱 워드로 지정된 단어는 추출에서 제외한다.
ngram_range	Bag of Words 모델의 단어 순서를 어느 정도 보강하기 위한 n-gram 범위를 설정한다. 튜플 형태로 (범위 최솟값, 범위 최댓값)을 지정한다. 예를 들어 (1, 1)로 지정하면 토큰화된 단어를 1개씩 피처로 추출한다. (1, 2)로 지정하면 토큰화된 단어를 1개씩(minimum 1), 그리고 순서대로 2개씩(maximum 2) 묶어서 피처로 추출한다.
analyzer	피처 추출을 수행한 단위를 지정한다. 디폴트는 ‘word’이다. Word가 아니라 character의 특정 범위를 피처로 만드는 특정한 경우 등을 적용할 때 사용된다.
token_pattern	토큰화를 수행하는 정규 표현식 패턴을 지정한다. 디폴트 값은 ‘\b\w\w+\b로, 공백 또는 개행 문자 등으로 구분된 단어 분리자(\b) 사이의 2문자(문자 또는 숫자) 이상의 단어(Word)를 토큰으로 분리한다. analyzer = ‘word’로 설정했을 때만 변경 가능하나 디폴트 값을 변경할 경우는 거의 발생하지 않는다. 어근 추출시 외부 함수를 사용할 경우 해당 외부 함수를 token_pattern의 인자로 사용한다.
tokenizer	토큰화를 별도의 커스텀 함수로 이용시 적용한다. 일반적으로 CountTokenizer 클래스에서 어근 변환 시 이를 수행하는 별도의 함수를 tokenizer 파라미터에 적용하면 된다.
lower_case	모든 문자를 소문자로 변경할 것인지를 설정한다. 디폴트 값은 True이다.

CountVectorizer를 이용한 피처 벡터화

사전 데이터 가공
모든 문자를 소문자로 변환하는 등의 사전 작업 수행 (Default로 lowercase = True)
토큰화
Default는 단어 기준(analyzer = True)이며 n_gram_range를 반영하여 토큰화 수행
텍스트 정규화
stop Words 필터리망 수행
Stemmer, Lemmatize는 CountVectorizer 자체에서는 지원되지 않음
이를 위한 함수를 만들거나 외부 패키지로 미리 Text Normalization 수행 필요
피처 벡터화
max_df, min_df, max_features 등의 파라미터를 반영하여 Token된 단어들을 feature extraction후 vectorization 적용

<실습>

text_sample_01 = 'The Matrix is everywhere its all around us, here even in this room. \
                  You can see it out your window or on your television. \
                  You feel it when you go to work, or go to church or pay your taxes.'
text_sample_02 = 'You take the blue pill and the story ends.  You wake in your bed and you believe whatever you want to believe\
                  You take the red pill and you stay in Wonderland and I show you how deep the rabbit-hole goes.'
text=[]
text.append(text_sample_01); text.append(text_sample_02)
print(text,"\n", len(text))

['The Matrix is everywhere its all around us, here even in this room.                   You can see it out your window or on your television.                   You feel it when you go to work, or go to church or pay your taxes.', 'You take the blue pill and the story ends.  You wake in your bed and you believe whatever you want to believe                  You take the red pill and you stay in Wonderland and I show you how deep the rabbit-hole goes.'] 
 2

CountVectorizer객체 생성 후 fit(), transform()으로 텍스트에 대한 feature vectorization 수행

from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization으로 feature extraction 변환 수행. 
cnt_vect = CountVectorizer()
cnt_vect.fit(text)

CountVectorizer()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ftr_vect = cnt_vect.transform(text)

피처 벡터화 후 데이터 유형 및 여러 속성 확인

print(type(ftr_vect), ftr_vect.shape)
print(ftr_vect)

<class 'scipy.sparse._csr.csr_matrix'> (2, 51)
  (0, 0)	1
  (0, 2)	1
  (0, 6)	1
  (0, 7)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	2
  (0, 15)	1
  (0, 18)	1
  (0, 19)	1
  (0, 20)	2
  (0, 21)	1
  (0, 22)	1
  (0, 23)	1
  (0, 24)	3
  (0, 25)	1
  (0, 26)	1
  (0, 30)	1
  (0, 31)	1
  (0, 36)	1
  (0, 37)	1
  (0, 38)	1
  (0, 39)	1
  (0, 40)	2
  :	:
  (1, 1)	4
  (1, 3)	1
  (1, 4)	2
  (1, 5)	1
  (1, 8)	1
  (1, 9)	1
  (1, 14)	1
  (1, 16)	1
  (1, 17)	1
  (1, 18)	2
  (1, 27)	2
  (1, 28)	1
  (1, 29)	1
  (1, 32)	1
  (1, 33)	1
  (1, 34)	1
  (1, 35)	2
  (1, 38)	4
  (1, 40)	1
  (1, 42)	1
  (1, 43)	1
  (1, 44)	1
  (1, 47)	1
  (1, 49)	7
  (1, 50)	1

print(cnt_vect.vocabulary_)

{'the': 38, 'matrix': 22, 'is': 19, 'everywhere': 11, 'its': 21, 'all': 0, 'around': 2, 'us': 41, 'here': 15, 'even': 10, 'in': 18, 'this': 39, 'room': 30, 'you': 49, 'can': 6, 'see': 31, 'it': 20, 'out': 25, 'your': 50, 'window': 46, 'or': 24, 'on': 23, 'television': 37, 'feel': 12, 'when': 45, 'go': 13, 'to': 40, 'work': 48, 'church': 7, 'pay': 26, 'taxes': 36, 'take': 35, 'blue': 5, 'pill': 27, 'and': 1, 'story': 34, 'ends': 9, 'wake': 42, 'bed': 3, 'believe': 4, 'whatever': 44, 'want': 43, 'red': 29, 'stay': 33, 'wonderland': 47, 'show': 32, 'how': 17, 'deep': 8, 'rabbit': 28, 'hole': 16, 'goes': 14}

cnt_vect = CountVectorizer(max_features=5, stop_words='english')
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(cnt_vect.vocabulary_)

<class 'scipy.sparse._csr.csr_matrix'> (2, 5)
{'window': 4, 'pill': 1, 'wake': 2, 'believe': 0, 'want': 3}

ngram_range 확인

cnt_vect = CountVectorizer(ngram_range=(1,3))
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(cnt_vect.vocabulary_)

<class 'scipy.sparse._csr.csr_matrix'> (2, 201)
{'the': 129, 'matrix': 77, 'is': 66, 'everywhere': 40, 'its': 74, 'all': 0, 'around': 11, 'us': 150, 'here': 51, 'even': 37, 'in': 59, 'this': 140, 'room': 106, 'you': 174, 'can': 25, 'see': 109, 'it': 69, 'out': 90, 'your': 193, 'window': 165, 'or': 83, 'on': 80, 'television': 126, 'feel': 43, 'when': 162, 'go': 46, 'to': 143, 'work': 171, 'church': 28, 'pay': 93, 'taxes': 125, 'the matrix': 132, 'matrix is': 78, 'is everywhere': 67, 'everywhere its': 41, 'its all': 75, 'all around': 1, 'around us': 12, 'us here': 151, 'here even': 52, 'even in': 38, 'in this': 60, 'this room': 141, 'room you': 107, 'you can': 177, 'can see': 26, 'see it': 110, 'it out': 70, 'out your': 91, 'your window': 199, 'window or': 166, 'or on': 86, 'on your': 81, 'your television': 197, 'television you': 127, 'you feel': 179, 'feel it': 44, 'it when': 72, 'when you': 163, 'you go': 181, 'go to': 47, 'to work': 148, 'work or': 172, 'or go': 84, 'to church': 146, 'church or': 29, 'or pay': 88, 'pay your': 94, 'your taxes': 196, 'the matrix is': 133, 'matrix is everywhere': 79, 'is everywhere its': 68, 'everywhere its all': 42, 'its all around': 76, 'all around us': 2, 'around us here': 13, 'us here even': 152, 'here even in': 53, 'even in this': 39, 'in this room': 61, 'this room you': 142, 'room you can': 108, 'you can see': 178, 'can see it': 27, 'see it out': 111, 'it out your': 71, 'out your window': 92, 'your window or': 200, 'window or on': 167, 'or on your': 87, 'on your television': 82, 'your television you': 198, 'television you feel': 128, 'you feel it': 180, 'feel it when': 45, 'it when you': 73, 'when you go': 164, 'you go to': 182, 'go to work': 49, 'to work or': 149, 'work or go': 173, 'or go to': 85, 'go to church': 48, 'to church or': 147, 'church or pay': 30, 'or pay your': 89, 'pay your taxes': 95, 'take': 121, 'blue': 22, 'pill': 96, 'and': 3, 'story': 118, 'ends': 34, 'wake': 153, 'bed': 14, 'believe': 17, 'whatever': 159, 'want': 156, 'red': 103, 'stay': 115, 'wonderland': 168, 'show': 112, 'how': 56, 'deep': 31, 'rabbit': 100, 'hole': 54, 'goes': 50, 'you take': 187, 'take the': 122, 'the blue': 130, 'blue pill': 23, 'pill and': 97, 'and the': 6, 'the story': 138, 'story ends': 119, 'ends you': 35, 'you wake': 189, 'wake in': 154, 'in your': 64, 'your bed': 194, 'bed and': 15, 'and you': 8, 'you believe': 175, 'believe whatever': 18, 'whatever you': 160, 'you want': 191, 'want to': 157, 'to believe': 144, 'believe you': 20, 'the red': 136, 'red pill': 104, 'you stay': 185, 'stay in': 116, 'in wonderland': 62, 'wonderland and': 169, 'and show': 4, 'show you': 113, 'you how': 183, 'how deep': 57, 'deep the': 32, 'the rabbit': 134, 'rabbit hole': 101, 'hole goes': 55, 'you take the': 188, 'take the blue': 123, 'the blue pill': 131, 'blue pill and': 24, 'pill and the': 98, 'and the story': 7, 'the story ends': 139, 'story ends you': 120, 'ends you wake': 36, 'you wake in': 190, 'wake in your': 155, 'in your bed': 65, 'your bed and': 195, 'bed and you': 16, 'and you believe': 9, 'you believe whatever': 176, 'believe whatever you': 19, 'whatever you want': 161, 'you want to': 192, 'want to believe': 158, 'to believe you': 145, 'believe you take': 21, 'take the red': 124, 'the red pill': 137, 'red pill and': 105, 'pill and you': 99, 'and you stay': 10, 'you stay in': 186, 'stay in wonderland': 117, 'in wonderland and': 63, 'wonderland and show': 170, 'and show you': 5, 'show you how': 114, 'you how deep': 184, 'how deep the': 58, 'deep the rabbit': 33, 'the rabbit hole': 135, 'rabbit hole goes': 102}

BOW 벡터화를 위한 희소 행렬

BOW의 Vectorization 모델은 너무 많은 0 값이 메모리 공간에 할당되어 많은 메모리 공간이 필요하며 또는 연산 시에도 데이터 액세스를 위한 많은 시간이 소모된다.

파이썬에서는 희소 행렬을 COO, CSR 형식으로 변환하기 위해서 Scipy의 coo_matrix( ), csr_matrix( ) 함수를 이용한다.

희소 행렬 - COO 형식

Coordinate(좌표) 방식을 의미하며 0이 아닌 데이터만 별도의 배열(Array)에 저장하고 그 데이터를 가리키는 행과 열의 위치를 별도의 배열로 저장하는 방식

<실습>

import numpy as np

dense = np.array( [ [ 3, 0, 1 ], 
                    [0, 2, 0 ] ] )

from scipy import sparse

# 0 이 아닌 데이터 추출
data = np.array([3,1,2])

# 행 위치와 열 위치를 각각 array로 생성 
row_pos = np.array([0,0,1])
col_pos = np.array([0,2,1])

# sparse 패키지의 coo_matrix를 이용하여 COO 형식으로 희소 행렬 생성
sparse_coo = sparse.coo_matrix((data, (row_pos,col_pos)))

print(type(sparse_coo))
print(sparse_coo)
dense01=sparse_coo.toarray()
print(type(dense01),"\n", dense01)

<class 'scipy.sparse._coo.coo_matrix'>
  (0, 0)	3
  (0, 2)	1
  (1, 1)	2
<class 'numpy.ndarray'> 
 [[3 0 1]
 [0 2 0]]

희소 행렬 - CSR 형식

COO 형식이 위치 배열값을 중복적으로 가지는 문제를 해결한 방식
일반적으로 CSR 형식이 COO보다 많이 사용됨

<실습>

from scipy import sparse

dense2 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

# 0 이 아닌 데이터 추출
data2 = np.array([1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1])

# 행 위치와 열 위치를 각각 array로 생성 
row_pos = np.array([0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5])
col_pos = np.array([2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0])

# COO 형식으로 변환 
sparse_coo = sparse.coo_matrix((data2, (row_pos,col_pos)))

# 행 위치 배열의 고유한 값들의 시작 위치 인덱스를 배열로 생성
row_pos_ind = np.array([0, 2, 7, 9, 10, 12, 13])

# CSR 형식으로 변환 
sparse_csr = sparse.csr_matrix((data2, col_pos, row_pos_ind))

print('COO 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인')
print(sparse_coo.toarray())
print('CSR 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인')
print(sparse_csr.toarray())

COO 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]
CSR 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]

print(sparse_csr)

  (0, 2)	1
  (0, 5)	5
  (1, 0)	1
  (1, 1)	4
  (1, 3)	3
  (1, 4)	2
  (1, 5)	5
  (2, 1)	6
  (2, 3)	3
  (3, 0)	2
  (4, 3)	7
  (4, 5)	8
  (5, 0)	1

dense3 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

coo = sparse.coo_matrix(dense3)
csr = sparse.csr_matrix(dense3)

Twitter Facebook LinkedIn

Sieun Kim

[Python 머신러닝] 08-3 Bag of Words - BOW

텍스트 분석

텍스트 피처 벡터화 유형

Bag of Words - BOW

Bag-of-Words 구조

BOW 장단점

BOW 피처 벡터화

BOW 피처 벡터화 유형

TF-IDF(Tearm Frequency Inverse Document Frequency)

사이킷런의 Count 및 TF-IDF 벡터화 구현: CountVectorizer, TfidfVectorizer

사이킷런 CountVectorizer 초기화 파라미터

CountVectorizer를 이용한 피처 벡터화

<실습>

BOW 벡터화를 위한 희소 행렬

희소 행렬 - COO 형식

<실습>

희소 행렬 - CSR 형식

<실습>

공유하기

참고

[CS] 1.1 디자인 패턴

[BIGDATA] 3-2 쿼리 엔진

[BIGDATA] 3-1 대규모 분산 처리의 프레임워크

[SQL] 08-3 GUI 응용 프로그램