4주차- 머신러닝 데이터분석

파이썬 머신러닝 판다스 데이터분석

4주차- 머신러닝 데이터분석

humpark 2024. 8. 25. 23:33

1. 머신러닝 개요

1-1 머신러닝이란?

머신러닝이란 기계 스스로 데이터를 학습하여 서로 다른 변수 간의 관계를 찾아 나가는 과정

해결하려는 문제에 따라 예측(prediction), 분류(classification), 군집(clustering) 알고리즘 등으로 분류된다.

1-2 지도학습 vs 비지도학습

지도학습: 정답 데이터를 다른 데이터와 함께 알고리즘에 입력

비지도학습: 정답 데이터 없이 컴퓨터 알고리즘 스스로 숨은 패턴 찾아내는 방식

1-3 머신러닝 프로세스

머신러닝을 실시하기 전에 먼저 알고리즘이 이해할 수 있는 형태로 데이터를 변환하는 작업이 선행되어야 한다.

분석 대상에 관해 수집한 관측값(observation)을 속성(feature)을 기준으로 정리한다.

그 뒤에 훈련 데이터를 모델에 입력해서 학습시키고, 검증 데이터로 알맞게 학습했는지 확인한다.

2. 회기분석

가격, 매츨, 주가, 환율, 수량 등 연속적인 값을 갖는 연속 변수를 예측하는데 주로 사용

분석 모형이 예측하고자 하는 목표를 종(dependent) 변수 또는 예측(prediction) 변수

예측을 위해 모형이 사용하는 속성을 독립(independent) 또는 설명(explain)변수

2-1 단순회기분석

두 변수 사이에 일대일로 대응되는 확률적, 통계적 상관성을 찾는 대표적인 지도학습 알고리즘

수학적으로는 종속변수 Y와 독립변수 X 사이의 관계를 1차함수 Y=aX+b 로 나타낸다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#step 1 데이터준비
df= pd.read_csv('C:/Users/sajog/Downloads/5674-980/pandas-data-analysis-main/part7/data/auto-mpg.csv', header=None)

#열 이름 지정
df.columns= ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'name']

#step 2 데이터 탐색

print(df.info())
print(df.describe())

horsepower 가 object 로 표기되어 이를 바꿔야 한다,

#horsepower 열의 자료형 변경(문자열 --> 숫자)
print(df['horsepower'].unique())    #horsepower 열의 고윳값 확인
print('\n')	# '?' 가 들어가서 문자열로 변한 것을 확인 -> 값 바꿔주기

#horsepower 열의 자료형 변경(문자열 --> 숫자)
print(df['horsepower'].unique())    #horsepower 열의 고윳값 확인

df['horsepower'].replace('?', np.nan, inplace=True) #'?'을 np.nan으로 변경
df.dropna(subset=['horsepower'], axis= 0, inplace=True) #누락 행 데이터 삭제
df['horsepower']= df['horsepower'].astype('float')  #문자열을 실수형으로 변경

print(df.describe())

#step 3 속성 선택
#분석에 활용할 열(속성) 선택(연비, 실린더, 출력, 중량)
ndf= df[['mpg', 'cylinders', 'horsepower', 'weight']]

print(ndf.head())

3개의 후보 중에서 단순회기분석에 사용할 독립 변수를 선택한다. 종속변수 Y와 독립변수 X간의 일대일 관계를 찾는 것이므로 두 변수 간에 선형관계가 있는지 그래프를 그려서 확인한다.

먼저 Mataplotlib 의 plot() 메소드에 kind= 'scatter' 옵션을 적용하여 산점도를 그린다. X축 데이터로 weight열을 지정하고 y축 데이터로 mpg열을 지정하여 두 변수 간의 상관성을 살펴보자

#종속 변수인 y인 연비(mpg)와 다른 변수간의 선형관계를 그래프로 확인
#mataplotlib 으로 산점도 그리기
ndf.plot(kind= 'scatter', x= 'weight', y='mpg', c='coral', s=10, figsize=(10,5))
plt.show()
plt.close()

#seaborn 으로 산점도 그리기
fig= plt.figure(figsize= (10,5))
ax1= fig.add_subplot(1, 2, 1)
ax2= fig.add_subplot(1, 2, 2)
sns.regplot(x='weight', y='mpg',data= ndf, ax= ax1)  #회귀선 표시
sns.regplot(x='weight', y='mpg',data= ndf, ax= ax2, fit_reg= False)  #회귀선 미표시

plt.show()
plt.close()

#seaborn 조인트 그래프- 산점도, 히스토그램
#seaborn 조인트 그래프- 산점도, 히스토그램
sns.jointplot(x='weight', y= 'mpg', data= ndf)  #회귀선 없음
sns.jointplot(x='weight', y= 'mpg', kind= 'reg', data= ndf)  #회귀선 표시

plt.show()
plt.close()

#seaborn pairplot으로 두 변수간의 모든 경우의 수 그리기
grid_ndf= sns.pairplot(ndf)
plt.show()
plt.close()

seaborn jointplot를 이용한 히스토그램이 포함된 산점도 그리기

# 속성(변수) 선택 - 예제는 'weight' 열을 독립 변수 X로 선택
X = ndf[['weight']] # 데이터프레임
print(type(X))
y = ndf['mpg']      # 시리즈

# data 분할
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, # 독립 변수
                                                    y, # 종속 변수
                                                    test_size=0.3, # 검증 30%
                                                    random_state=10) # 랜덤 추출 값
print(len(X_train))	
print(len(X_test))

#결과값
<class 'pandas.core.frame.DataFrame'>
274
118

#step 5 단순 회기분석 모형 만들기 -sklearn 사용

#sklearn 라이브러리에서 선형회기분석 모듈 가져오기
from sklearn.linear_model import LinearRegression

#단순회기분석 모형 객체 생성
lr= LinearRegression()

#train data를 가지고 모형 학습
lr.fit(X_train, y_train)

#학습을 마친 모형에 test data를 적용하여 결정계수(R제곱) 계산
r_square= lr.score(X_test, y_test)
print(r_square)

#결과값
0.6822458558299325

#회귀식의 기울기
print('기울기 a: ', lr.coef_)
print('\n')

#회귀식의 y절편
print('y절편 b: ', lr.intercept_)

#결과
기울기 a:  [-0.00775343]


y절편 b:  46.7103662572801

#모형에 전체 X 데이터를 입력하여 예측한 값 y_hat을 실제 값 y와 비교
y_hat= lr.predict(X)

plt.figure(figsize= (10,5))
ax1= sns.kdeplot(y, label= "y")
ax12= sns.kdeplot(y_hat, label= "y_hat", ax=ax1)
plt.legend()
plt.show()

2-2 다항회기분석

다항회귀분석(Polynomial Regression)은 2차함수 이상의 다항 함수를 이용하여 두 변수 간의 선형관계를 설명하는 알고리즘

Y = aX^2 + bX + c 으로 표시한다

# 분석에 활용할 열(속성) 선택
ndf = df[['mpg','cylinders','horsepower','weight']]

# 속성(변수) 선택 - 예제는 'weight' 열을 독립 변수 X로 선택
X = ndf[['weight']] # 데이터프레임
print(type(X))
y = ndf['mpg']      # 시리즈

# data 분할
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, # 독립 변수
                                                    y, # 종속 변수
                                                    test_size=0.3, # 검증 30%
                                                    random_state=10) # 랜덤 추출 값

모형 학습 검증

fit_transform() 메소드에 X_train 데이터를 전달하면 2차항 회귀분석에 맞게 변환된다.

#step 5 비선형화회기분석 모형- sklearn 사용

#sklearn 라이브러리에서 필요한 모듈 가져오기
from sklearn.linear_model import LinearRegression   #선형회기분석
from sklearn.preprocessing import PolynomialFeatures    #다항식 변환

#다항식 변환
poly= PolynomialFeatures(degree=2)  #2차항 적용
x_train_poly= poly.fit_transform(x_train)   #x_train 데이터를 2차항으로 변형

print('원 데이터: ', x_train.shape)
print('2차항 변환 데이터: ', x_train_poly.shape)

#결과값
원 데이터:  (274, 1)
2차항 변환 데이터:  (274, 3)

#train data를 가지고 모형 학습
pr= LinearRegression()
pr.fit(x_train_poly, y_train)

#학습을 마친 모형에 test data를 적용하여 결정계서(R^2) 계산
x_test_poly= poly.fit_transform(x_test) #x_test 데이터를 2차항으로 변형
r_square= pr.score(x_test_poly, y_test)
print(r_square)
#결과값
0.708700926297548

훈련 데이터의 분포와 학습된 모형의 회기선을 그래프로 출력해 비교해보자. 2차항으로 변환된 검증 데이터(x_test_poly)를 predict() 메소드에 입력하여 예측한 결과인 y_hat_test 를 빨간 점('+')으로 표시하면 회귀선이 된다.

#train data의 산점도와 test data로 예측한 회귀선을 그래프로 출력
y_hat_test= pr.predict(x_test_poly)

fig= plt.figure(figsize=(10,5))
ax= fig.add_subplot(1,1,1)
ax.plot(x_train, y_train, 'o', label='Train Data')  #데이터 분포
ax.plot(x_test, y_hat_test, 'r+', label='Predicted Value')  #모형이 학습한 회귀선
ax.legend(loc='best')
plt.xlabel('weight')
plt.ylabel('mpg')
plt.show()
plt.close()

# 모형에 전체 X 데이터를 입력하여 예측한 값 y_hat을 실제 값 y와 비교 
X_ploy = poly.fit_transform(X)
y_hat = pr.predict(X_ploy)

plt.figure(figsize=(10, 5))
ax1 = sns.kdeplot(y, label="y")
ax2 = sns.kdeplot(y_hat, label="y_hat", ax=ax1)
plt.legend()
plt.show()

선형회귀분석보다 더 적합하다(잘 들어맞는다)

2-3 다중회귀분석

종속변수에 영향을 주는 동립변수가 여러개일 경우(정답 데이터가 있으므로 지도학습)

# 속성(변수) 선택
x = ndf[['cylinders', 'horsepower', 'weight']] # 독립변수 x1, x2, x3
y = ndf['mpg']      # 종속변수 y

# data 분할
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, # 독립 변수
                                                    y, # 종속 변수
                                                    test_size=0.3, # 검증 30%
                                                    random_state=10) # 랜덤 추출 값
print('훈련 데이터: ', x_train.shape)	
print('검증 데이터: ', x_test.shape)

#step 5 단순 회기분석 모형 만들기 -sklearn 사용

#sklearn 라이브러리에서 선형회기분석 모듈 가져오기
from sklearn.linear_model import LinearRegression

#단순회기분석 모형 객체 생성
lr= LinearRegression()

#train data를 가지고 모형 학습
lr.fit(x_train, y_train)

#학습을 마친 모형에 test data를 적용하여 결정계수(R제곱) 계산
r_square= lr.score(x_test, y_test)
print(r_square)

#회귀식의 기울기
print('기울기 a: ', lr.coef_)
print('\n')

#회귀식의 y절편
print('y절편 b: ', lr.intercept_)

#train data의 산점도와 test data로 예측한 회귀선을 그래프로 출력

y_hat = lr.predict(x_test)

plt.figure(figsize=(10, 5))
ax1 = sns.kdeplot(y, label="y_test")
ax2 = sns.kdeplot(y_hat, label="y_hat", ax=ax1)
plt.legend()
plt.show()

3. 분류

분류(classification) 알고리즘은 예측하려는 대상의 속성을 입력받고, 목표 변수가 갖고 있는 카테고리값 중 어느 한 값으로 분류하여 예측한다. '하나의 예로 무게가 5kg 이상 나가고 크기가 지름 60cm인 타원형 과일은 수박이다' 등이 있다.

3-1 kNN

kNN은 k-Nearest-Neighbors의 약자로써 k개의 가까원 이웃을 찾아 분류하는 비지도학습이다.

import pandas as pd
import seaborn as sns

#step1 데이터 준비
#load_dataset 함수를 사용하여 데이터프레임으로 변환
df= sns.load_dataset('titanic')

#step2 데이터탐색/ 전처리
#print(df.info()) -> 의미가 중복되는 열, 누락값이 많은 열 삭제

#NaN 값이 많은 deck열 삭제, embarked와 내용이 겹차는 embark_town 열 삭제
rdf= df.drop(['deck', 'embark_town'], axis= 1)

#age 열에 나이 데이터가 없는 모든 행 삭제 - age 열(891개 중 177개의 NaN 값)
rdf= rdf.dropna(subset=['age'], how= 'any', axis=0)

#embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값으로 치환하기
most_freq= rdf['embarked'].value_counts(dropna=True).idxmax()

rdf['embarked'].fillna(most_freq, inplace=True)

#step3 분석에 사용할 속성 선택
ndf= rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]

#원인핫코딩- 범주형 데이터를 모형이 인식할 수 있도록 숫자형으로 변환
onehot_sex= pd.get_dummies(ndf['sex'])
ndf= pd.concat([ndf, onehot_sex], axis=1)
onehot_embarked= pd.get_dummies(ndf['embarked'], prefix='town')
ndf= pd.concat([ndf, onehot_embarked], axis=1)

ndf.drop(['sex', 'embarked'], axis=1, inplace=True)


#step 4 데이터셋 구분- 훈련용/ 검증용

#속성변수 선택
x=ndf[['pclass', 'age', 'sibsp', 'parch', 'female', 'male', 'town_C', 'town_Q', 'town_S']]  #설명변수 x
y=ndf['survived']   #예측변수 y

#설명 변수 데이터를 정규화(normalization)
from sklearn import preprocessing
x= preprocessing.StandardScaler().fit(x).transform(x)

#train data와 test data로 구분(7:3)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.3, random_state=10)

#step5 kNN 분류 모형- sklearn 사용
from sklearn.neighbors import KNeighborsClassifier

#모형 객체 생성(k=5로 설정)
knn= KNeighborsClassifier(n_neighbors=5)

#train data를 가지고 모형 학습
knn.fit(x_train, y_train)

#test data를 가지고 y_hat을 예측
y_hat= knn.predict(x_test)

#모형 성능 평가- Confusion Matrix 계산
from sklearn import metrics
knn_matrix= metrics.confusion_matrix(y_test, y_hat)

#모형 성능 평가- 평가 지표 게산
knn_report = metrics.classification_report(y_test, y_hat)
print(knn_report)
print('\n')
print(knn_matrix)

#결과값
              precision    recall  f1-score   support

           0       0.81      0.88      0.84       125
           1       0.81      0.71      0.76        90

    accuracy                           0.81       215
   macro avg       0.81      0.80      0.80       215
weighted avg       0.81      0.81      0.81       215



[[110  15]
 [ 26  64]]

3-2 SVM

데이터프레임의 각 열은 열 벡터 형태로 구현되어 열 벡터들이 각각 고유의 축을 갖는 벡터 공간을 만든다. 분석 대상이 되는 개별 관측값들은 모든 속성(열 벡터)에 관한 값을 해당 축의 좌표로 표시하여 벡터 공간에서의 위치를 나타낸다.

SVM 모형은 벡터 공간에 위치한 훈련 데이터의 좌표와 각 데이터가 어떤 분류 값을 가져야 하는지 정답을 입력 받아서 학습한다.

#데이터 정재과정까지는 위와 같음

#step5 SVM 분류 모형 가져오기
from sklearn import svm

#모형 객체 생성(kernel='rdf'적용)
svm_model= svm.SVC(kernel='rbf')

#train data를 가지고 모형 학습
svm_model.fit(x_train, y_train)

#test data를 가지고 y_hat을 예측
y_hat= svm_model.predict(x_test)

#모형 성능 평가- Confusion Matrix 계산
from sklearn import metrics
svm_matrix= metrics.confusion_matrix(y_test, y_hat)

#모형 성능 평가- 평가 지표 게산
svm_report = metrics.classification_report(y_test, y_hat)
print(knn_report)
print('\n')
print(knn_matrix)

#결과값
              precision    recall  f1-score   support

           0       0.81      0.88      0.84       125
           1       0.81      0.71      0.76        90

    accuracy                           0.81       215
   macro avg       0.81      0.80      0.80       215
weighted avg       0.81      0.81      0.81       215



[[110  15]
 [ 26  64]]

3-3 Decision Tree

의사결정나무는 컴퓨터 알고리즘에서 즐겨 사용하는 트리 구조를 사용하고, 각 분기점에는 분석 대상의 속성(설명변수) 들이 위치한다. 각 분기점마다 목표 값을 가장 잘 분류할 수 있는 속성을 찾아 배치하거, 해당 속성이 갖는 값을 이용하여 새로운 가지를 만든다.

각 분기점에서 최적의 속성을 선택할 때는 해당 속성을 기준으로 분류한 값들이 구분되는 정도를 측정한다. 다른 종류의 값들이 섞여 있는 정도를 나타내는 Entropy를 주로 활용하는데, Entropy가 낮을수록 분류가 잘 된 것이다.

import pandas as pd
import seaborn as sns
import numpy as np  # numpy import 추가
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree

# step1 데이터 준비
uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\
breast-cancer-wisconsin/breast-cancer-wisconsin.data'
df = pd.read_csv(uci_path, header=None)

# 열 이름 지정
df.columns = ['id', 'clump', 'cell_size', 'cell_shape', 'adhesion', 'epithlial',
              'bare_nuclei', 'chromatin', 'normal_nucleoli', 'mitoses', 'class']

# step 2 데이터 탐색 및 전처리
df['bare_nuclei'].replace('?', np.nan, inplace=True)
df.dropna(subset=['bare_nuclei'], axis=0, inplace=True)
df['bare_nuclei'] = df['bare_nuclei'].astype('int')

# step3 데이터셋 구분 - 훈련용, 검증용
# x는 설명 변수, y는 예측 변수
x = df[['clump', 'cell_size', 'cell_shape', 'adhesion', 'epithlial',
        'bare_nuclei', 'chromatin', 'normal_nucleoli', 'mitoses']]
y = df['class']      

# 설명 변수 데이터를 정규화
x = preprocessing.StandardScaler().fit(x).transform(x)

# train data와 test data로 구분 (7:3 비율)
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.3, 
                                                    random_state=10)

# step4 Decision Tree 분류 모형 - sklearn 사용
# 모형 객체 생성 (criterion='entropy' 적용)
tree_model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=5)

# train data를 가지고 모형 학습
tree_model.fit(x_train, y_train)

# test data를 가지고 y_hat 예측
y_hat = tree_model.predict(x_test)  # 2: benign, 4: malignant

# 모형 성능 평가 - 평가 지표 계산
tree_report = metrics.classification_report(y_test, y_hat)
tree_matrix = metrics.confusion_matrix(y_test, y_hat)

print(tree_report)
print('\n')
print(tree_matrix)

#결과값
              precision    recall  f1-score   support

           2       0.98      0.97      0.98       131
           4       0.95      0.97      0.96        74

    accuracy                           0.97       205
   macro avg       0.97      0.97      0.97       205
weighted avg       0.97      0.97      0.97       205



[[127   4]
 [  2  72]]

4. 군집(Clustering)

군집 분석은 데이터셋의 관측값이 갖고 있는 여러 속성을 분석하여 서로 비슷한 특징을 갖는 관측값끼리 같은 집단으로 묶는 알고리즘이다. 어느 클러스터에도 속하지 못하는 관측값이 존재할 수 있다.

군집 알고리즘은 비지도 학습으로써 유사한 특성을 갖는 집단을 구분하여, 행동 등을 예측할 수 있다.

군집 알고리즘은 신용카드 부정 사용탐지, 구매패턴 분석 등 소비자 행동특성을 그룹화하는데 사용된다.

4-1 k-Means

k-Means알고리즘은 데이터 간의 유사성을 측정하는 기준으로 각 클러스터의 중심까지의 거리를 이용한다. 벡터 공간에 위치한 어떤 데이터에 대하여 k개의 클러스터가 주어졌을 때 클러스터의 중심까지 거리가 가장 가까운 클러스터로 해당 데이터를 할당한다. k값에 따라 모형의 성능이 달라진다. 일반적으로 k가 클수록 모형의 정확도는 개선되지만, k값이 너무 커지면 선택지가 많아져 분서의 효과가 사라진다

import pandas as pd
import matplotlib.pyplot as plt

#step 1 데이터 준비
# Wholesale customers 데이터셋 가져오기 (출처: UCI ML Repository)
uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\
00292/Wholesale%20customers%20data.csv'
df = pd.read_csv(uci_path, header=0)

#step 2 데이터 탐색
#print(df.head()), df.info(), df.describe()를 활용하여 각 데이터셋의 설정이 올바르게 자료형인지, 숫자형인지, 문자열인지 등등을 확인한다.
#'?' 값으로 인해서 숫자형인 열이 문자형으로 된 것을 확인하기 위해서 unique() 메소드를 통해 고유값을 확인한다
#'?'를 np.nan으로 바꿔주고 해당 데이터가 있는 행을 dropna()메소드를 활용하여 전부 삭제한다.

#step 3 데이터 전처리
#분석에 사용할 속성 선택
x= df.iloc[:, :] 

#설명 변수 데이터 정규화
from sklearn import preprocessing
x= preprocessing.StandardScaler().fit(x).transform(x)

#step 4 k-Means 군집 모형- sklearn 사용
from sklearn import cluster

#모형 객체 생성
kmeans= cluster.KMeans(init= 'k-means++', n_clusters=5, n_init=10)

#모형 학습
kmeans.fit(x)

#예측(군집)
cluster_label= kmeans.labels_

#예측 결과를 데이터프레임에 추가
df['Cluster']= cluster_label

#그래프로 표현 - 시각화
df.plot(kind= 'scatter', x='Grocery', y='Frozen', c='Cluster', cmap='Set1', colorbar=True, figsize=(10,10))
plt.show()
plt.close()

outlier를 제거하고 다시 그려보기

#큰 값으로 구성된 클러스터(0,4) 제외- 값이 몰려 있는 구간을 자세하게 분석
mask= (df['Cluster']==0) | (df['Cluster']==4)
ndf= df[~mask]

ndf.plot(kind='scatter', x='Grocery', y='Frozen', c='Cluster', cmap='Set1', colorbar=False, figsize=(10,10))
ndf.plot(kind='scatter', x='Milk', y='Delicassen', c='Cluster', cmap='Set1', colorbar=True, figsize=(10,10))
plt.show()
plt.close()

4-2 DBSCAM

Density-Based Spatial Clustering of Applications with Noise 은 데이터가 위치하고 있는 공간 밀집도를 기준으로 클러스터를 구분한다.자기를 중심으로 반지름 R의 공간에 최소 M개의 포인트가 존재하는 점 을 코어 포인트라고 부른다.

코어 포인트는 아니지만 반지름 R안에 다른 코어 포인트가 있을 경우 경계 포인트라고 한다.

코어 포인트도 아니고 경계 포인트에도 속하지 않는 점을 Noise(또는 outlier)라고 부른다.

import pandas as pd
import folium

#Step 1 데이터 준비

# 서울시내 중학교 진학률 데이터셋
file_path = 'C:/Users/sajog/Downloads/5674-980/pandas-data-analysis-main/part7/data/middle_shcool_graduates_report.xlsx'
df = pd.read_excel(file_path)

#step 2 데이터 탐색
#print(df.head()), df.info(), df.describe()를 활용하여 각 데이터셋의 설정이 올바르게 자료형인지, 숫자형인지, 문자열인지 등등을 확인한다.
#'?' 값으로 인해서 숫자형인 열이 문자형으로 된 것을 확인하기 위해서 unique() 메소드를 통해 고유값을 확인한다
#'?'를 np.nan으로 바꿔주고 해당 데이터가 있는 행을 dropna()메소드를 활용하여 전부 삭제한다.

# 누락 데이터 확인
df.isnull().sum().sum()

# 중복 데이터 확인
df.duplicated().sum()

# 지도에 위치 표시 ***  Stamen Terrain 타일의 경우 별도의 인증을 요구하므로 OpenTopoMap 타일을 사용하는 것으로 수정 ***
# https://leaflet-extras.github.io/leaflet-providers/preview/

attr = (
    'Map data: &copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors, <a href="http://viewfinderpanoramas.org">SRTM</a> | Map style: &copy; <a href="https://opentopomap.org">OpenTopoMap</a> (<a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA</a>)'
)

tiles = 'https://{s}.tile.opentopomap.org/{z}/{x}/{y}.png'

mschool_map = folium.Map(location=[37.55,126.98], tiles=tiles, attr=attr, 
                         zoom_start=12)

# 중학교 위치정보를 CircleMarker로 표시
for name, lat, lng in zip(df['학교명'], df['위도'], df['경도']):
    folium.CircleMarker([lat, lng],
                        radius=5,              # 원의 반지름
                        color='brown',         # 원의 둘레 색상
                        fill=True,
                        fill_color='coral',    # 원을 채우는 색
                        fill_opacity=0.7,      # 투명도    
                        popup=name
    ).add_to(mschool_map)
    
mschool_map    

# 지도를 html 파일로 저장하기
mschool_map.save('C:/Users/sajog/Downloads/5674-980/pandas-data-analysis-main/part7/data/seoul_mschool_location.html')

#Step 3 데이터 전처리
# 원-핫 인코딩 적용
df_encoded = pd.get_dummies(df, columns=['지역', '코드', '유형', '주야'])

df_encoded.head()


#Step 4 DBSCAN 군집 모형 - sklearn 사용
from sklearn import cluster
from sklearn import preprocessing  

# 분석에 사용할 속성을 선택 
train_features = ['과학고', '외고_국제고', '자사고', '자공고', 
                  '유형_공립', '유형_국립', '유형_사립',]
x = df_encoded.loc[:, train_features]

# 설명 변수 데이터를 정규화
x = preprocessing.StandardScaler().fit_transform(x)

# DBSCAN 모형 객체 생성
dbm = cluster.DBSCAN(eps=0.2, min_samples=5)

# 모형 학습
dbm.fit(x)   
 
# 예측 (군집) 
cluster_label = dbm.labels_   

# 예측 결과를 데이터프레임에 추가
df_encoded['Cluster'] = cluster_label
df_encoded.head()

# 클러스터 값으로 그룹화하고, 그룹별로 내용 출력 (첫 5행만 출력)
grouped_cols = ['학교명', '과학고', '외고_국제고', '자사고',] 
grouped = df_encoded.groupby('Cluster')
for key, group in grouped:
    print('* key :', key)
    print('* number :', len(group))    
    print(group.loc[:, grouped_cols].head())
    print('\n')

# 그래프로 표현 - 시각화
colors = {-1:'gray', 0:'coral', 1:'blue', 2:'green', 3:'red', 4:'purple', 
          5:'orange', 6:'brown', 7:'brick', 8:'yellow', 9:'magenta', 10:'cyan', 11:'tan'}

cluster_map = folium.Map(location=[37.55,126.98], tiles=tiles, attr=attr, 
                         zoom_start=12)

for name, lat, lng, clus in zip(df_encoded['학교명'], df_encoded['위도'], 
                                df_encoded['경도'], df_encoded['Cluster']):   
    folium.CircleMarker([lat, lng],
                        radius=5,                   # 원의 반지름
                        color=colors[clus],         # 원의 둘레 색상
                        fill=True,
                        fill_color=colors[clus],    # 원을 채우는 색
                        fill_opacity=0.7,           # 투명도    
                        popup=name
    ).add_to(cluster_map)

cluster_map

# 지도를 html 파일로 저장하기
cluster_map.save('C:/Users/sajog/Downloads/5674-980/pandas-data-analysis-main/part7/data/seoul_mschool_cluster.html')

# X2 데이터셋에 대하여 위의 과정을 반복(과학고, 외고국제고, 자사고 진학률 + 유형)
columns_list2 = [9, 10, 13]
X2 = df.iloc[:, columns_list2]
print(X2[:5])
print('\n')

X2 = preprocessing.StandardScaler().fit(X2).transform(X2)
dbm2 = cluster.DBSCAN(eps=0.2, min_samples=5)
dbm2.fit(X2)  
df['Cluster2'] = dbm2.labels_   

grouped2_cols = [0, 1, 3] + columns_list2
grouped2 = df.groupby('Cluster2')
for key, group in grouped2:
    print('* key :', key)
    print('* number :', len(group))    
    print(group.iloc[:, grouped2_cols].head())
    print('\n')

cluster2_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain', attr='Map data: &copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors, <a href="http://viewfinderpanoramas.org">SRTM</a> | Map style: &copy; <a href="https://opentopomap.org">OpenTopoMap</a> (<a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA</a>)',
                        zoom_start=12)


for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster2):  
    folium.CircleMarker([lat, lng],
                        radius=5,                   # 원의 반지름
                        color=colors[clus],         # 원의 둘레 색상
                        fill=True,
                        fill_color=colors[clus],    # 원을 채우는 색
                        fill_opacity=0.7,           # 투명도    
                        popup=name
    ).add_to(cluster2_map)

# 지도를 html 파일로 저장하기
cluster2_map.save('C:/Users/sajog/Downloads/5674-980/pandas-data-analysis-main/part7/data/seoul_mschool_cluster2.html')


# X3 데이터셋에 대하여 위의 과정을 반복(과학고, 외고_국제고)
columns_list3 = [9, 10]
X3 = df.iloc[:, columns_list3]
print(X3[:5])
print('\n')

X3 = preprocessing.StandardScaler().fit(X3).transform(X3)
dbm3 = cluster.DBSCAN(eps=0.2, min_samples=5)
dbm3.fit(X3)  
df['Cluster3'] = dbm3.labels_   

grouped3_cols = [0, 1, 3] + columns_list3
grouped3 = df.groupby('Cluster3')
for key, group in grouped3:
    print('* key :', key)
    print('* number :', len(group))    
    print(group.iloc[:, grouped3_cols].head())
    print('\n')

cluster3_map = folium.Map(location=[37.55,126.98], tiles='Stamen Terrain', attr= 'Map data: &copy; <a href="https://www.openstreetmap.org/copyright">OpenStreetMap</a> contributors, <a href="http://viewfinderpanoramas.org">SRTM</a> | Map style: &copy; <a href="https://opentopomap.org">OpenTopoMap</a> (<a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA</a>)',

                        zoom_start=12)

for name, lat, lng, clus in zip(df.학교명, df.위도, df.경도, df.Cluster3):  
    folium.CircleMarker([lat, lng],
                        radius=5,                   # 원의 반지름
                        color=colors[clus],         # 원의 둘레 색상
                        fill=True,
                        fill_color=colors[clus],    # 원을 채우는 색
                        fill_opacity=0.7,           # 투명도    
                        popup=name
    ).add_to(cluster3_map)

# 지도를 html 파일로 저장하기
cluster3_map.save('C:/Users/sajog/Downloads/5674-980/pandas-data-analysis-main/part7/data/seoul_mschool_cluster3.html')

'파이썬 머신러닝 판다스 데이터분석' 카테고리의 다른 글

프로젝트에 사용했던 라이브러리 (1)	2024.08.27
4주차- 데이터프레임의 다양한 응용 (0)	2024.08.22
3주차-데이터 사전 처리 (0)	2024.08.21
3주차-시각화 도구 및 데이터 사전 처리 (0)	2024.08.21
2주차- 데이터 살펴보기 (0)	2024.08.14

현재글4주차- 머신러닝 데이터분석

humpark 님의 블로그

humpark 님의 블로그 입니다.

Today :
Yesterday :

humpark 님의 블로그