데이터 전처리 코드 정리

Data Science

HAN___ 2022. 11. 16. 20:35

import numpy as np

import pandas as pd

pd.read_csv("")

0. 기본 정보 확인

1. 결측치 확인(Missing Data)

2. 중복 데이터 확인

3. 이상치 확인(Outlier)

4. 정규화(Normalization)

5. 원-핫 인코딩(One-Hot Encoding)

6. 구간화(Binning)

df.drop('삭제할컬럼명', axis=1, inplace=True): 컬럼 삭제

df['컬럼명'].astype(변경할타입종류): 컬럼 타입 변경

df['컬럼명'].replace(해당값, 변경할 값, inplace=True): 컬럼 값 변경df['컬럼명'].value_counts(): 컬럼의 분포 확인 ~ value 별 개수

시각화import matplotlib.pyplot as plt%matplotlib inlineimport seaborn as sns

sns.histplot(data=df, x="", hue=""): 히스토그램 그리기(x를 hue로 구분)
sns.kdeplot(data=df, x="", hue=""): KDE(커널밀도추정) 그래프 그리기
sns.countplot(data=df, x="", hue=""): 항목당 value count한 값으로 그래프 그리기(히스토그램과 비슷)
sns.heatmap(df[['컬럼1', '컬럼2', '컬럼3', ..., '']].corr(), annot=True) : 상관관계 히트맵 그리기
sns.boxplot(data=df, x="", y=""): 이상치 확인

df.to_csv("", index=False): 결과 저장, index=False 주어야 기존 인덱스 값이 저장되지 않음