k-Means 演算法
使用 scikit-learn
KMeans 的重要建構子參數
KMeans 的重要成員變數:
KMeans 的重要個例操作:
部分測試資料:
chinese | english | math |
77 | 89 | 63 |
73 | 40 | 60 |
69 | 57 | 50 |
85 | 67 | 60 |
55 | 55 | 55 |
80 | 84 | 83 |
80 | 70 | 70 |
60 | 61 | 60 |
60 | 80 | 70 |
75 | 91 | 53 |
62 | 62 | 67 |
66 | 75 | 75 |
67 | 40 | 89 |
72 | 60 | 42 |
74 | 62 | 67 |
78 | 86 | 85 |
70 | 63 | 60 |
78 | 80 | 69 |
82 | 82 | 78 |
excel為1~46列,第一列為標題列
df.shape = (45, 3)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
fpath = r”C:\Python\P107\doc\student_grades_real.csv”
df = pd.read_csv(fpath)
#print(df) ; df.shape = (45, 3)
lisChiCent = [] ; lisEngCent = [] ; lisMathCent= []
Xtrain, Xtest = train_test_split\
(df, test_size= 0.33,
random_state=42, shuffle = True)
#()中資料只有放一個df,輸出Xtrain, Xtest兩個
#先前放了X,y兩份資料,才能輸出
#Xtrain, Xtest, ytrain, ytest四個
print(“Xtrain:”, type(Xtrain), Xtrain.shape)
#<class ‘pandas.core.frame.DataFrame’> (30, 3)
print(“Xtest:”, type(Xtest), Xtest.shape)
# <class ‘pandas.core.frame.DataFrame’> (15, 3)
kmean = KMeans(n_clusters =2 ,random_state =42)
distMatrix = kmean.fit_transform(Xtrain)
“””
#distMatrix.shape = (30, 2) ; numpy.ndarray
# Xtrain列數30, 分為2 cluster
#產生模型,也順便列出屬性串列中的
每個項目屬於哪個聚類的對應矩陣,
描述樣本到各組中心點的距離
“””
cluster_cent = pd.DataFrame\
(kmean.cluster_centers_,
columns = df.columns.tolist())
“””
#kmean.cluster_centers_
array([[76.26666667, 83.13333333, 75.13333333],
[64.93333333, 59.26666667, 56.8 ]])
2個group,每個group的中心點都有
x,y,z (chinese, english, math)
“””
print(“cluster center:\n”,cluster_cent)
print(“The distance:”, kmean.inertia_)
print(“Totally”,kmean.n_iter_,
“iterations executed for finding the stable cluster”)
print(“The distance matrix from raw data to cluster:\n”,
pd.DataFrame(distMatrix,
columns= [“to cluster#0”, “to cluster#1”]) )
XtrainNew = pd.DataFrame(Xtrain,
columns= df.columns.tolist())
#columns = [‘chinese’, ‘english’, ‘math’]
XtrainNew.insert(loc = df.columns.size,
column = “groupID”,
value = kmean.labels_)
print(“The updated Xtrain:\n”, XtrainNew)
#kmean.labels_
#array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
1, 0, 0, 0, 1, 1, 1, 0])
#ndarray shape = (30,)
部分輸出結果
(DataFrame distMatrix截掉一些):
distMatrix.shape = (30, 2)
distMatrix = kmean.fit_transform(Xtrain)
舉例: 第16位學生
距離cluster 0的質心21.8
距離cluster 1的質心15
其實沒有明顯的區分
但數學上就直接比大小
歸類為距離比較近的cluster 1
參考解答:
推薦hahow線上學習python: https://igrape.net/30afN
承接前面的程式碼:
lisChiCent = [] ; lisEngCent = [] ; lisMathCent= []
for item in range(2):
chi = kmean.cluster_centers_[item,0]
lisChiCent.append(chi)
eng = kmean.cluster_centers_[item,1]
lisEngCent.append(eng)
math = kmean.cluster_centers_[item,2]
lisMathCent.append(math)
“””cluster_centers_
一個二維 ndarray ,列出所有聚類的中心點
kmean.cluster_centers_
array([[76.26666667, 83.13333333, 75.13333333],
[64.93333333, 59.26666667, 56.8 ]])
lisChiCent, lisEngCent, lisMathCent 長度皆為2
kmean.cluster_centers_ 是ndarray
其實不須for迴圈,可以用切片的方法
取出兩個質心的座標
(lisChiCent, lisEngCent, lisMathCent)
“””
# lisChi_1 = [] ; lisChi_2 = []
# lisEng_1 = [] ; lisEng_2 = []
# lisMath_1 = [] ; lisMath_2 = []
groupIDary = XtrainNew[“groupID”].values
lisTrueIdx = groupIDary.nonzero()[0].tolist()
“””
以下方法可以得到一樣的lisTrueIdx
若groupID有三種以上可以派上用場:
“””
XtrainNew.index = range(len(XtrainNew))
XtrainNew1 = XtrainNew.iloc[lisTrueIdx,:]
XtrainNew0 = XtrainNew.drop(lisTrueIdx,axis=0)
#依據groupID將XtrainNew分為
#XtrainNew1(group1) , XtrainNew0(group0)
lisChi0 = XtrainNew0[“chinese”].tolist()
lisChi1 = XtrainNew1[“chinese”].tolist()
lisEng0 = XtrainNew0[“english”].tolist()
lisEng1 = XtrainNew1[“english”].tolist()
lisMath0 = XtrainNew0[“math”].tolist()
lisMath1 = XtrainNew1[“math”].tolist()
fig = plt.figure()
ax = plt.axes( projection =”3d” )
ax.scatter(lisChi0, lisEng0, lisMath0,
label=”student cluster0″, color=”b”, marker=”^”)
ax.scatter(lisChi1, lisEng1, lisMath1,
label=”student cluster1″, color=”g”, marker=”*”)
ax.scatter(lisChiCent, lisEngCent, lisMathCent,
label=”cluster center”,color=”r”, marker=”o”)
參考解答:
3D 散佈圖:
視覺化顯示:學生分群的彩色顯示:
推薦hahow線上學習python: https://igrape.net/30afN
用剩下的驗證樣本預測:
承接上面的程式碼:
label_test = kmean.predict(Xtest)
#array([0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0])
# shape = (15,)
#前面訓練樣本使用 kmean.labels_
#現在測試樣本使用 kmean.predict(Xtest)
print(“kmean.predict(Xtest):\n”,label_test)
dfXtest = pd.DataFrame(data = Xtest,
columns = df.columns.tolist())
dfXtest.insert(loc = dfXtest.columns.size ,
column = “groupid”,
value = label_test)
print(“The updated df Xtest:\n”,dfXtest)
print(“The precision of test sample =”,
kmean.score(Xtest))
“””雖然一樣有.predict()可以用
但跟前面的監督式學習不同
非監督式學習分群演算法的預測值
無真實資料可以比對
無標準答案
score是依據負的質心距離總和評價
越接近0越好”””
輸出結果:
其他分群演算法
https://scikit-learn.org/stable/modules/clustering.html
推薦hahow線上學習python: https://igrape.net/30afN