Allen's 데이터 맛집

[머신러닝] 분류 : Pima Indians Diabetes Database 본문

Project/Kaggle 분석&기계학습

[머신러닝] 분류 : Pima Indians Diabetes Database

Allen93 2023. 9. 28. 22:55

About Dataset

Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Acknowledgements

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

Inspiration

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

 


데이터 불러오기

# 시험환경 세팅 (코드 변경 X)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def exam_data_load(df, target, id_name="", null_name=""):
    if id_name == "":
        df = df.reset_index().rename(columns={"index": "id"})
        id_name = 'id'
    else:
        id_name = id_name
    
    if null_name != "":
        df[df == null_name] = np.nan
    
    X_train, X_test = train_test_split(df, test_size=0.2, random_state=2021)
    
    y_train = X_train[[id_name, target]]
    X_train = X_train.drop(columns=[target])

    
    y_test = X_test[[id_name, target]]
    X_test = X_test.drop(columns=[target])
    return X_train, X_test, y_train, y_test 
    
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
X_train, X_test, y_train, y_test = exam_data_load(df, target='Outcome')

X_train.shape, X_test.shape, y_train.shape, y_test.shape

X = X_train
X_submission = X_test
Y = y_train

>>> ((614, 9), (154, 9), (614, 2), (154, 2))

 

 

EDA

dfX.nunique()

dfX.describe()

dfX.isnull().sum()

 

데이터 전처리

dfX = pd.concat([X, X_submission])
dfX.shape

test_id = X_submission.pop('id')

#이상치 처리 (Glucose, BloodPressure, SkinThickness, Insulin, BMI가 0인 값)
temp1 = dfX.Glucose.mean()
temp2 = dfX.BloodPressure.mean()
temp3 = dfX.SkinThickness.mean()
temp4 = dfX.Insulin.mean()
temp5 = dfX.BMI.mean()

dfX['Glucose'] = dfX['Glucose'].replace(0,temp1 )
dfX['BloodPressure'] = dfX['BloodPressure'].replace(0,temp2 )
dfX['SkinThickness'] = dfX['SkinThickness'].replace(0,temp3 )
dfX['Insulin'] = dfX['Insulin'].replace(0,temp4 )
dfX['BMI'] = dfX['BMI'].replace(0,temp5 )

dfX.describe()

>>> (768, 9)

 

 

 

학습 모델 생성 및 평가

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

size = X.shape[0]
X_use = dfX.iloc[:size,:]
X_submission_use = dfX.iloc[size :,:]

print(X_use.shape, X_submission_use.shape)

>>> (614, 8) (154, 8)

xtrain, xtest, ytrain, ytest = train_test_split(X_use, Y,test_size = 0.25, stratify = Y, random_state = 7438
                                               )

print([x.shape for x in [xtrain, xtest, ytrain, ytest]])

>>> [(460, 8), (154, 8), (460,), (154,)]

 

model = RandomForestClassifier(max_depth = 5, random_state=7438).fit(xtrain, ytrain.values.ravel())
pred = model.predict(X_submission_use)

submission = pd.DataFrame({'cust_id': test_id, 'gender':pred})

submission.to_csv('result.csv', index=False)
df = pd.read_csv('result.csv')
df

 

y_test['Outcome']

model.score(X_submission_use, y_test['Outcome'])

>>> 0.779220779220...

 

 

케글출처 : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database