[Python 머신러닝] 07-1 K-평균 알고리즘 이해

2024-05-17 10 분 소요

군집화

K-평균 알고리즘 이해

군집화(Clustering)

데이터 포인트들을 별개의 군집으로 그룹화 하는 것을 의미한다.
유사성이 높은 데이터들을 동일한 그룹으로 분류하고 서로 다른 군집들이 상이성을 가지도록 그룹화 한다.

군집화 활용 분야

고객, 마켓, 브랜드, 사회 경제 활동 세분화(Segmetataion)
Image 검출, 세분화, 트랙킹
이상 검출(Abnomaly detection)

군집화 알고리즘

K-Means
Mean Shift
Gaussian Mixture Model
DBSCAN

K-Means Clustering

군집 중심점(Centroid) 기반 클러스터링

2개의 군집 중심점 설정 -> 각 데이터들은 가장 가까운 중심점에 소속 -> 중심점에 할당된 데이터들의 평균 중심으로 중심점 이동 -> 각 데이터들은 이동된 중심점을 기준으로 가장 가까운 중심점에 소속 -> 다시 중심점에 할당된 데이터들의 평균 중심으로 중심점 이동

중심점을 이동하였지만 데이터들의 중심점 소속 변경이 없으면 군집화 완료

K-Means의 장점과 단점

장점
- 일반적인 군집화에서 가장 많이 활용되는 알고리즘이다.
- 알고리즘이 쉽고 간결하다.
- 대용량 데이터에도 활용이 가능하다.
단점
- 거리 기반 알고리즘으로 속성의 개수가 매우 많을 경우 군집화 정확도가 떨어진다. (이를 위해 PCA로 차원 축소를 적용해야 할 수도 있다.)
- 반복을 수행하는데 반복 횟수가 많을 경우 수행 시간이 느려진다.
- 이상치(Outlier) 데이터에 취약하다.

사이킷런 KMeans 클래스 소개

사이킷런 KMeans 클래스

사이킷런 패키지는 K-평균을 구현하기 위해 KMeans 클래스를 제공한다.
KMeans 클래스는 다음과 같은 초기화 파라미터를 가지고 있다.

class sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')

K-평균을 이용한 붓꽃 데이터 세트 군집화

<실습>

from sklearn.preprocessing import scale
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

iris = load_iris()
print('target name:', iris.target_names)
# 보다 편리한 데이터 Handling을 위해 DataFrame으로 변환
irisDF = pd.DataFrame(data=iris.data, columns=['sepal_length','sepal_width','petal_length','petal_width'])
irisDF.head(3)

target name: ['setosa' 'versicolor' 'virginica']

	sepal_length	sepal_width	petal_length	petal_width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2

KMeans 객체를 생성하고 군집화 수행

labels_ 속성을 통해 각 데이터 포인트별로 할당된 군집 중심점(Centroid)확인
fit_predict(), fit_transform() 수행 결과 확인.

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300,random_state=0)
kmeans.fit(irisDF)

KMeans(n_clusters=3, random_state=0)

print(kmeans.labels_)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
0]

kmeans.fit_predict(irisDF)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0])

kmeans.fit_transform(irisDF)

array([[3.41925061, 0.14135063, 5.0595416 ],
       [3.39857426, 0.44763825, 5.11494335],
       [3.56935666, 0.4171091 , 5.27935534],
       [3.42240962, 0.52533799, 5.15358977],
       [3.46726403, 0.18862662, 5.10433388],
       [3.14673162, 0.67703767, 4.68148797],
       [3.51650264, 0.4151867 , 5.21147652],
       [3.33654987, 0.06618157, 5.00252706],
       [3.57233779, 0.80745278, 5.32798107],
       [3.3583767 , 0.37627118, 5.06790865],
       [3.32449131, 0.4824728 , 4.89806763],
       [3.31126872, 0.25373214, 4.9966845 ],
       [3.46661272, 0.50077939, 5.19103612],
       [3.90578362, 0.91322505, 5.65173594],
       [3.646649  , 1.01409073, 5.10804455],
       [3.49427881, 1.20481534, 4.88564095],
       [3.495248  , 0.6542018 , 5.03090587],
       [3.38444981, 0.1441527 , 5.02342022],
       [3.11245944, 0.82436642, 4.61792995],
       [3.37738931, 0.38933276, 4.97213426],
       [3.07471224, 0.46344363, 4.6955761 ],
       [3.31506588, 0.3286031 , 4.9236821 ],
       [3.93167253, 0.64029681, 5.59713396],
       [3.01233762, 0.38259639, 4.68193765],
       [3.06241269, 0.48701129, 4.75095704],
       [3.19414543, 0.45208406, 4.90772894],
       [3.17967089, 0.20875823, 4.84545508],
       [3.30941724, 0.21536016, 4.93969029],
       [3.37648183, 0.21066561, 5.01833618],
       [3.31272968, 0.40838707, 5.02954567],
       [3.26550651, 0.41373905, 4.98608729],
       [3.18083736, 0.42565244, 4.79550372],
       [3.53142353, 0.71552778, 5.06520776],
       [3.57102821, 0.91977171, 5.04438334],
       [3.31992769, 0.34982853, 5.02985959],
       [3.56904033, 0.35039977, 5.25071556],
       [3.43783276, 0.52685861, 5.02368214],
       [3.53114948, 0.25686572, 5.17865184],
       [3.66205264, 0.76077592, 5.40750095],
       [3.31092773, 0.11480418, 4.9664149 ],
       [3.49764675, 0.18541845, 5.14520862],
       [3.60850034, 1.24803045, 5.38423754],
       [3.68120561, 0.6690142 , 5.40847417],
       [3.14278239, 0.38675574, 4.78803478],
       [3.00585191, 0.60231221, 4.59828494],
       [3.39468045, 0.48205809, 5.11844067],
       [3.32788568, 0.41034132, 4.92421655],
       [3.51879523, 0.47199576, 5.23766854],
       [3.34104251, 0.40494444, 4.92859681],
       [3.40601705, 0.14959947, 5.08216833],
       [1.22697525, 3.98049997, 1.25489071],
       [0.684141  , 3.57731464, 1.44477759],
       [1.17527644, 4.13366423, 1.01903626],
       [0.73153652, 3.01144152, 2.45978458],
       [0.63853451, 3.74779669, 1.3520017 ],
       [0.26937898, 3.34908644, 1.88009327],
       [0.76452634, 3.74283048, 1.28902785],
       [1.58388575, 2.23937045, 3.37155487],
       [0.75582717, 3.71181627, 1.41123804],
       [0.85984838, 2.8005678 , 2.58955659],
       [1.53611907, 2.60022691, 3.27864111],
       [0.32426175, 3.17042268, 1.90055758],
       [0.80841374, 3.08317693, 2.38073698],
       [0.39674141, 3.64581678, 1.45909603],
       [0.87269542, 2.51268382, 2.60303733],
       [0.87306498, 3.59732957, 1.50822767],
       [0.41229163, 3.36719171, 1.85387593],
       [0.53579956, 2.94753796, 2.25517257],
       [0.6367639 , 3.70615434, 1.74778451],
       [0.71254917, 2.80841236, 2.49557781],
       [0.7093731 , 3.79583719, 1.37094403],
       [0.46349013, 3.02383531, 2.06563694],
       [0.69373966, 3.99098735, 1.29106776],
       [0.43661144, 3.60360653, 1.57547425],
       [0.54593856, 3.37448959, 1.70495043],
       [0.74313017, 3.56196294, 1.52298639],
       [0.98798453, 4.01083283, 1.18965415],
       [1.06739835, 4.20528001, 0.84636259],
       [0.21993519, 3.47401497, 1.61913335],
       [1.0243726 , 2.42676328, 2.77868071],
       [0.86396528, 2.73795179, 2.6440625 ],
       [0.97566381, 2.62259032, 2.75566654],
       [0.55763082, 2.83096803, 2.32254696],
       [0.73395781, 4.07263797, 1.22324554],
       [0.57500396, 3.33772078, 1.9942056 ],
       [0.68790275, 3.47153856, 1.61049622],
       [0.92700552, 3.87741924, 1.19803047],
       [0.61459444, 3.56224367, 1.81572464],
       [0.50830256, 2.93359506, 2.20430516],
       [0.6291191 , 2.94237659, 2.40438484],
       [0.48790256, 3.23598208, 2.14635877],
       [0.38266958, 3.5438369 , 1.52402278],
       [0.49185351, 2.94407541, 2.26286106],
       [1.5485635 , 2.28455247, 3.33648305],
       [0.3856087 , 3.08064604, 2.16211718],
       [0.44284695, 3.01190637, 2.11299567],
       [0.3449879 , 3.0607156 , 2.07973003],
       [0.37241653, 3.29690461, 1.76829182],
       [1.66064034, 1.99117553, 3.44291999],
       [0.38393196, 2.99098312, 2.16527941],
       [2.0445799 , 5.23113563, 0.77731871],
       [0.85382472, 4.13898297, 1.29757391],
       [2.05245342, 5.26319105, 0.30610139],
       [1.33089245, 4.63585807, 0.65293923],
       [1.72813078, 5.00515534, 0.38458885],
       [2.87401886, 6.06204421, 1.14225684],
       [1.07101875, 3.49513662, 2.4108337 ],
       [2.39730707, 5.6002125 , 0.78573677],
       [1.67668563, 4.9963967 , 0.65454939],
       [2.54158648, 5.60667281, 0.8435596 ],
       [1.17541367, 4.31225927, 0.74552218],
       [1.13563278, 4.46533089, 0.75289837],
       [1.59322675, 4.81086063, 0.25958095],
       [0.88917352, 4.11543193, 1.48572618],
       [1.20227628, 4.34736472, 1.30303821],
       [1.42273608, 4.57650303, 0.68288333],
       [1.33403966, 4.59734489, 0.50991553],
       [3.20105585, 6.21697515, 1.47791217],
       [3.20759942, 6.46018421, 1.52971038],
       [0.82617494, 4.07258886, 1.53708992],
       [1.91251832, 5.08121836, 0.26952816],
       [0.81891975, 3.95519658, 1.5334904 ],
       [2.9794431 , 6.17779734, 1.31149299],
       [0.74269596, 4.05452587, 1.10668455],
       [1.75847731, 4.92787784, 0.27627819],
       [2.14580999, 5.27958142, 0.52766931],
       [0.62526165, 3.92137476, 1.20765678],
       [0.70228926, 3.95155412, 1.16212743],
       [1.4663925 , 4.78518338, 0.54629196],
       [1.93773659, 5.06442297, 0.59428255],
       [2.31885342, 5.51111422, 0.7312665 ],
       [3.07340053, 5.99783127, 1.43802246],
       [1.51444141, 4.8248088 , 0.5605572 ],
       [0.81536685, 4.10808715, 1.05631592],
       [1.23209127, 4.50967626, 1.12133058],
       [2.6381171 , 5.75940796, 0.95311851],
       [1.72401927, 4.84127876, 0.73306362],
       [1.31541133, 4.557541  , 0.57903109],
       [0.61011676, 3.83775716, 1.29960041],
       [1.60532899, 4.7581488 , 0.34794609],
       [1.77481954, 4.97393004, 0.3893492 ],
       [1.53937059, 4.59878027, 0.68403844],
       [0.85382472, 4.13898297, 1.29757391],
       [2.00764279, 5.21394093, 0.30952112],
       [1.94554509, 5.09187392, 0.50939919],
       [1.44957743, 4.60916261, 0.61173881],
       [0.89747884, 4.21767471, 1.10072376],
       [1.17993324, 4.41184542, 0.65334214],
       [1.50889317, 4.59925864, 0.83572418],
       [0.83452741, 4.0782815 , 1.1805499 ]])

군집화 결과를 irisDF에 ‘cluster’ 컬럼으로 추가하고 target 값과 결과 비교

iris.target, iris.target_names

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 array(['setosa', 'versicolor', 'virginica'], dtype='<U10'))

irisDF['target'] = iris.target
irisDF['cluster']=kmeans.labels_
irisDF.head(10)

	sepal_length	sepal_width	petal_length	petal_width	cluster
0	5.1	3.5	1.4	0.2	1
1	4.9	3.0	1.4	0.2	1
2	4.7	3.2	1.3	0.2	1
3	4.6	3.1	1.5	0.2	1
4	5.0	3.6	1.4	0.2	1
5	5.4	3.9	1.7	0.4	1
6	4.6	3.4	1.4	0.3	1
7	5.0	3.4	1.5	0.2	1
8	4.4	2.9	1.4	0.2	1
9	4.9	3.1	1.5	0.1	1

irisDF['target'] = iris.target
irisDF['cluster']=kmeans.labels_

iris_result = irisDF.groupby(['target','cluster'])['sepal_length'].count()
print(iris_result)

target  cluster
0       1          50
1       0          48
        2           2
2       0          14
        2          36
Name: sepal_length, dtype: int64

-> setosa: cluster 1에 50개 모두 맵핑
versicolor: cluster 0에 48개, cluster 2에 2개 맵핑
virginica: cluster 0에 14개, cluster 2에 36개 맵핑

2차원 평면에 데이터 포인트별로 군집화된 결과를 나타내기 위해 2차원 PCA값으로 각 데이터 차원축소

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_transformed = pca.fit_transform(iris.data)

irisDF['pca_x'] = pca_transformed[:,0]
irisDF['pca_y'] = pca_transformed[:,1]
irisDF.head(3)

	sepal_length	sepal_width	petal_length	petal_width	cluster	pca_x	pca_y
0	5.1	3.5	1.4	0.2	1	-2.684126	0.319397
1	4.9	3.0	1.4	0.2	1	-2.714142	-0.177001
2	4.7	3.2	1.3	0.2	1	-2.888991	-0.144949

irisDF.head(10)

	sepal_length	sepal_width	petal_length	petal_width	cluster	pca_x	pca_y
0	5.1	3.5	1.4	0.2	1	-2.684126	0.319397
1	4.9	3.0	1.4	0.2	1	-2.714142	-0.177001
2	4.7	3.2	1.3	0.2	1	-2.888991	-0.144949
3	4.6	3.1	1.5	0.2	1	-2.745343	-0.318299
4	5.0	3.6	1.4	0.2	1	-2.728717	0.326755
5	5.4	3.9	1.7	0.4	1	-2.280860	0.741330
6	4.6	3.4	1.4	0.3	1	-2.820538	-0.089461
7	5.0	3.4	1.5	0.2	1	-2.626145	0.163385
8	4.4	2.9	1.4	0.2	1	-2.886383	-0.578312
9	4.9	3.1	1.5	0.1	1	-2.672756	-0.113774

# cluster 값이 0, 1, 2 인 경우마다 별도의 Index로 추출
marker0_ind = irisDF[irisDF['cluster']==0].index
marker1_ind = irisDF[irisDF['cluster']==1].index
marker2_ind = irisDF[irisDF['cluster']==2].index

# cluster값 0, 1, 2에 해당하는 Index로 각 cluster 레벨의 pca_x, pca_y 값 추출. o, s, ^ 로 marker 표시
plt.scatter(x=irisDF.loc[marker0_ind,'pca_x'], y=irisDF.loc[marker0_ind,'pca_y'], marker='o') 
plt.scatter(x=irisDF.loc[marker1_ind,'pca_x'], y=irisDF.loc[marker1_ind,'pca_y'], marker='s')
plt.scatter(x=irisDF.loc[marker2_ind,'pca_x'], y=irisDF.loc[marker2_ind,'pca_y'], marker='^')

plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('3 Clusters Visualization by 2 PCA Components')
plt.show()

output_14_0

plt.scatter(x=irisDF.loc[:, 'pca_x'], y=irisDF.loc[:, 'pca_y'], c=irisDF['cluster']) 

<matplotlib.collections.PathCollection at 0x256d2570fd0>

output_15_17

군집화 알고리즘 테스트를 위한 데이터 생성

<실습>

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
%matplotlib inline

X, y = make_blobs(n_samples=200, n_features=2, centers=3, cluster_std=0.8, random_state=0)
print(X.shape, y.shape)

# y target 값의 분포를 확인
unique, counts = np.unique(y, return_counts=True)
print(unique,counts)

(200, 2) (200,)
[0 1 2] [67 67 66]

n_samples: 생성할 총 데이터의 개수입니다. 디폴트는 100개입니다.
n_features: 데이터의 피처 개수입니다. 시각화를 목표로 할 경우 2개로 설정해 보통 첫 번째 피처는 x 좌표, 두 번째 피처 는 y 좌표상에 표현합니다.
centers: int 값, 예를 들어 3으로 설정하면 군집의 개수를 나타냅니다. 그렇지 않고 ndarray 형태로 표현할 경우 개별 군 집 중심점의 좌표를 의미합니다.
cluster_std: 생성될 군집 데이터의 표준 편차를 의미합니다. 만일 float 값 0.8과 같은 형태로 지정하면 군집 내에서 데이 터가 표준편차 0.8을 가진 값으로 만들어집니다.
[0.8, 1,2, 0.6]과 같은 형태로 표현되면 3개의 군집에서 첫 번째 군집 내 데이터의 표준편차는 0.8, 두 번째 군집 내 데이터의 표준 편차는 1.2, 세 번째 군집 내 데이터의 표준편차는 0.6으로 만듭 니다.
군집별로 서로 다른 표준 편차를 가진 데이터 세트를 만들 때 사용합니다

import pandas as pd

clusterDF = pd.DataFrame(data=X, columns=['ftr1', 'ftr2'])
clusterDF['target'] = y
clusterDF.head(10)

	ftr1	ftr2	target
0	-1.692427	3.622025	2
1	0.697940	4.428867	0
2	1.100228	4.606317	0
3	-1.448724	3.384245	2
4	1.214861	5.364896	0
5	-0.908302	1.970778	2
6	2.472119	0.437033	1
7	1.656842	2.441289	1
8	1.077800	4.625379	0
9	-1.679427	2.602003	2

make_blob()으로 만들어진 데이터 포인트들을 시각화

target_list = np.unique(y)
# 각 target별 scatter plot 의 marker 값들. 
markers=['o', 's', '^', 'P','D','H','x']
# 3개의 cluster 영역으로 구분한 데이터 셋을 생성했으므로 target_list는 [0,1,2]
# target==0, target==1, target==2 로 scatter plot을 marker별로 생성. 
for target in target_list:
    target_cluster = clusterDF[clusterDF['target']==target]
    plt.scatter(x=target_cluster['ftr1'], y=target_cluster['ftr2'], edgecolor='k', marker=markers[target] )
plt.show()

output_21_07

target_list = np.unique(y)
plt.scatter(x=clusterDF['ftr1'], y=clusterDF['ftr2'], edgecolor='k', c=y )

<matplotlib.collections.PathCollection at 0x256d32e3a60>

output_22_1

clusterDF

	ftr1	ftr2	target	kmeans_label
0	-1.692427	3.622025	2	1
1	0.697940	4.428867	0	0
2	1.100228	4.606317	0	0
3	-1.448724	3.384245	2	1
4	1.214861	5.364896	0	0
...	...	...	...	...
195	2.956576	0.033718	1	2
196	-2.074113	4.245523	2	1
197	2.783411	1.151438	1	2
198	1.226724	3.620511	0	0
199	1.474790	-0.209028	1	2

200 rows × 4 columns

K-Means 클러스터링을 수행하고 개별 클러스터의 중심 위치를 시각화

kmeans.cluster_centers_

array([[ 0.990103  ,  4.44666506],
       [-1.70636483,  2.92759224],
       [ 1.95763312,  0.81041752]])

# KMeans 객체를 이용하여 X 데이터를 K-Means 클러스터링 수행 
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, random_state=0)
cluster_labels = kmeans.fit_predict(X)
clusterDF['kmeans_label']  = cluster_labels

#cluster_centers_ 는 개별 클러스터의 중심 위치 좌표 시각화를 위해 추출
centers = kmeans.cluster_centers_
unique_labels = np.unique(cluster_labels)
markers=['o', 's', '^', 'P','D','H','x']

# 군집된 label 유형별로 iteration 하면서 marker 별로 scatter plot 수행. 
for label in unique_labels:
    label_cluster = clusterDF[clusterDF['kmeans_label']==label]
    plt.scatter(x=label_cluster['ftr1'], y=label_cluster['ftr2'], edgecolor='k', 
                marker=markers[label] )
    
    center_x_y = centers[label]
    
    # 군집별 중심 위치 좌표 시각화 
    plt.scatter(x=center_x_y[0], y=center_x_y[1], s=200, color='white',
                alpha=0.9, edgecolor='k', marker=markers[label])
    plt.scatter(x=center_x_y[0], y=center_x_y[1], s=70, color='k', edgecolor='k', 
                marker='$%d$' % label)

plt.show()

output_26_0

target_list = np.unique(y)
plt.scatter(x=clusterDF['ftr1'], y=clusterDF['ftr2'], edgecolor='k', c=y )

<matplotlib.collections.PathCollection at 0x256d3463ca0>

output_27_1

kmeans.cluster_centers_

array([[ 0.990103  ,  4.44666506],
       [-1.70636483,  2.92759224],
       [ 1.95763312,  0.81041752]])

print(clusterDF.groupby('target')['kmeans_label'].value_counts())

target  kmeans_label
0       0               66
        1                1
1       2               67
2       1               65
        2                1
Name: kmeans_label, dtype: int64

Twitter Facebook LinkedIn

Sieun Kim

[Python 머신러닝] 07-1 K-평균 알고리즘 이해

군집화

K-평균 알고리즘 이해

군집화(Clustering)

군집화 활용 분야

군집화 알고리즘

K-Means Clustering

K-Means의 장점과 단점

사이킷런 KMeans 클래스 소개

사이킷런 KMeans 클래스

K-평균을 이용한 붓꽃 데이터 세트 군집화

<실습>

군집화 알고리즘 테스트를 위한 데이터 생성

<실습>

공유하기

참고

[CS] 1.1 디자인 패턴

[BIGDATA] 3-2 쿼리 엔진

[BIGDATA] 3-1 대규모 분산 처리의 프레임워크

[SQL] 08-3 GUI 응용 프로그램

	sepal_length	sepal_width	petal_length	petal_width	cluster
0	5.1	3.5	1.4	0.2	1
1	4.9	3.0	1.4	0.2	1
2	4.7	3.2	1.3	0.2	1
3	4.6	3.1	1.5	0.2	1
4	5.0	3.6	1.4	0.2	1
5	5.4	3.9	1.7	0.4	1
6	4.6	3.4	1.4	0.3	1
7	5.0	3.4	1.5	0.2	1
8	4.4	2.9	1.4	0.2	1
9	4.9	3.1	1.5	0.1	1

	sepal_length	sepal_width	petal_length	petal_width	cluster
0	5.1	3.5	1.4	0.2	1
1	4.9	3.0	1.4	0.2	1
2	4.7	3.2	1.3	0.2	1
3	4.6	3.1	1.5	0.2	1
4	5.0	3.6	1.4	0.2	1
5	5.4	3.9	1.7	0.4	1
6	4.6	3.4	1.4	0.3	1
7	5.0	3.4	1.5	0.2	1
8	4.4	2.9	1.4	0.2	1
9	4.9	3.1	1.5	0.1	1

	sepal_length	sepal_width	petal_length	petal_width	cluster
0	5.1	3.5	1.4	0.2	1
1	4.9	3.0	1.4	0.2	1
2	4.7	3.2	1.3	0.2	1
3	4.6	3.1	1.5	0.2	1
4	5.0	3.6	1.4	0.2	1
5	5.4	3.9	1.7	0.4	1
6	4.6	3.4	1.4	0.3	1
7	5.0	3.4	1.5	0.2	1
8	4.4	2.9	1.4	0.2	1
9	4.9	3.1	1.5	0.1	1