攝影或3C

Python 非監督式機器學習: 距離導向聚類法(k-Means 演算法); 使用 scikit-learn ; 學生分群 ; from sklearn.cluster import KMeans

k-Means 演算法

使用 scikit-learn

KMeans 的重要建構子參數

KMeans 的重要成員變數:

KMeans 的重要個例操作:

https://pse.is/3wuxml

部分測試資料:

chinese english math
77 89 63
73 40 60
69 57 50
85 67 60
55 55 55
80 84 83
80 70 70
60 61 60
60 80 70
75 91 53
62 62 67
66 75 75
67 40 89
72 60 42
74 62 67
78 86 85
70 63 60
78 80 69
82 82 78

excel為1~46列,第一列為標題列

df.shape = (45, 3)

 

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

fpath = r”C:\Python\P107\doc\student_grades_real.csv”
df = pd.read_csv(fpath)
#print(df) ; df.shape = (45, 3)

lisChiCent = [] ; lisEngCent = [] ; lisMathCent= []

Xtrain, Xtest = train_test_split\
(df, test_size= 0.33,
random_state=42, shuffle = True)
#()中資料只有放一個df,輸出Xtrain, Xtest兩個
#先前放了X,y兩份資料,才能輸出
#Xtrain, Xtest, ytrain, ytest四個
print(“Xtrain:”, type(Xtrain), Xtrain.shape)
#<class ‘pandas.core.frame.DataFrame’> (30, 3)
print(“Xtest:”, type(Xtest), Xtest.shape)
# <class ‘pandas.core.frame.DataFrame’> (15, 3)

kmean = KMeans(n_clusters =2 ,random_state =42)
distMatrix = kmean.fit_transform(Xtrain)

“””
#distMatrix.shape = (30, 2) ; numpy.ndarray

# Xtrain列數30, 分為2 cluster

#產生模型,也順便列出屬性串列中的

每個項目屬於哪個聚類的對應矩陣,

描述樣本到各組中心點的距離

“””
cluster_cent = pd.DataFrame\
(kmean.cluster_centers_,
columns = df.columns.tolist())
“””
#kmean.cluster_centers_
array([[76.26666667, 83.13333333, 75.13333333],
[64.93333333, 59.26666667, 56.8 ]])
2個group,每個group的中心點都有
x,y,z (chinese, english, math)
“””
print(“cluster center:\n”,cluster_cent)
print(“The distance:”, kmean.inertia_)
print(“Totally”,kmean.n_iter_,
“iterations executed for finding the stable cluster”)
print(“The distance matrix from raw data to cluster:\n”,
pd.DataFrame(distMatrix,
columns= [“to cluster#0”, “to cluster#1”]) )
XtrainNew = pd.DataFrame(Xtrain,
columns= df.columns.tolist())
#columns = [‘chinese’, ‘english’, ‘math’]
XtrainNew.insert(loc = df.columns.size,
column = “groupID”,
value = kmean.labels_)
print(“The updated Xtrain:\n”, XtrainNew)

#kmean.labels_

#array([1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
1, 0, 0, 0, 1, 1, 1, 0])

#ndarray shape = (30,)

部分輸出結果

(DataFrame distMatrix截掉一些):

distMatrix.shape = (30, 2)

distMatrix = kmean.fit_transform(Xtrain)

舉例: 第16位學生

距離cluster 0的質心21.8

距離cluster 1的質心15

其實沒有明顯的區分

但數學上就直接比大小

歸類為距離比較近的cluster 1

參考解答:

推薦hahow線上學習python: https://igrape.net/30afN

承接前面的程式碼:

lisChiCent = [] ; lisEngCent = [] ; lisMathCent= []
for item in range(2):
    chi = kmean.cluster_centers_[item,0]
    lisChiCent.append(chi)
    eng = kmean.cluster_centers_[item,1]
    lisEngCent.append(eng)
    math = kmean.cluster_centers_[item,2]
    lisMathCent.append(math)
“””cluster_centers_
一個二維 ndarray ,列出所有聚類的中心點
kmean.cluster_centers_
array([[76.26666667, 83.13333333, 75.13333333],
[64.93333333, 59.26666667, 56.8 ]])
lisChiCent, lisEngCent, lisMathCent 長度皆為2

kmean.cluster_centers_ 是ndarray

其實不須for迴圈,可以用切片的方法

取出兩個質心的座標

(lisChiCent, lisEngCent, lisMathCent)
“””

# lisChi_1 = [] ; lisChi_2 = []
# lisEng_1 = [] ; lisEng_2 = []
# lisMath_1 = [] ; lisMath_2 = []

groupIDary = XtrainNew[“groupID”].values

lisTrueIdx = groupIDary.nonzero()[0].tolist()

“””

以下方法可以得到一樣的lisTrueIdx

若groupID有三種以上可以派上用場:

“””
XtrainNew.index = range(len(XtrainNew))
XtrainNew1 = XtrainNew.iloc[lisTrueIdx,:]
XtrainNew0 = XtrainNew.drop(lisTrueIdx,axis=0)
#依據groupID將XtrainNew分為
#XtrainNew1(group1) , XtrainNew0(group0)
lisChi0 = XtrainNew0[“chinese”].tolist()
lisChi1 = XtrainNew1[“chinese”].tolist()

lisEng0 = XtrainNew0[“english”].tolist()
lisEng1 = XtrainNew1[“english”].tolist()

lisMath0 = XtrainNew0[“math”].tolist()
lisMath1 = XtrainNew1[“math”].tolist()

fig = plt.figure()
ax = plt.axes( projection =”3d” )
ax.scatter(lisChi0, lisEng0, lisMath0,
label=”student cluster0″, color=”b”, marker=”^”)
ax.scatter(lisChi1, lisEng1, lisMath1,
label=”student cluster1″, color=”g”, marker=”*”)
ax.scatter(lisChiCent, lisEngCent, lisMathCent,
label=”cluster center”,color=”r”, marker=”o”)

參考解答:

3D 散佈圖:

視覺化顯示:學生分群的彩色顯示:

推薦hahow線上學習python: https://igrape.net/30afN

用剩下的驗證樣本預測:

承接上面的程式碼:

label_test = kmean.predict(Xtest)

#array([0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0])

# shape = (15,)

#前面訓練樣本使用 kmean.labels_

#現在測試樣本使用 kmean.predict(Xtest)

print(“kmean.predict(Xtest):\n”,label_test)

dfXtest = pd.DataFrame(data = Xtest,
columns = df.columns.tolist())
dfXtest.insert(loc = dfXtest.columns.size ,
column = “groupid”,
value = label_test)
print(“The updated df Xtest:\n”,dfXtest)
print(“The precision of test sample =”,
kmean.score(Xtest))

“””雖然一樣有.predict()可以用

但跟前面的監督式學習不同

非監督式學習分群演算法的預測值

無真實資料可以比對

無標準答案

score是依據負的質心距離總和評價

越接近0越好”””

輸出結果:

 

其他分群演算法

https://scikit-learn.org/stable/modules/clustering.html

推薦hahow線上學習python: https://igrape.net/30afN

儲蓄保險王

儲蓄險是板主最喜愛的儲蓄工具,最喜愛的投資理財工具則是ETF,最喜愛的省錢工具則是信用卡