현장에서 만들어진 실제 데이터는 오류를 포함하고 있기 때문에 분석하기 전에 데이터 정제 과정이 필요하다.

결측치 정제하기¶

결측치 : 누락된 값, 비어 있는 값 ## 결측치 만들기

In [1]:

import pandas as pd
import numpy as np

In [2]:

df = pd.DataFrame({'sex' : ['M', 'F', np.nan, 'M', 'F'],
                   'score' : [5, 4, 3, 4, np.nan]})
df

Out[2]:

	sex	score
0	M	5.0
1	F	4.0
2	NaN	3.0
3	M	4.0
4	F	NaN

결측치가 있는 상태에서 연산을 하면 출력 결과도 결측치이다.

In [7]:

df['score'] + 1

Out[7]:

0    6.0
1    5.0
2    4.0
3    5.0
4    NaN
Name: score, dtype: float64

결측치 확인하기¶

In [8]:

pd.isna(df)

Out[8]:

	sex	score
0	False	False
1	False	False
2	True	False
3	False	False
4	False	True

In [10]:

pd.isna(df).sum() # 결측치 빈도 확인

Out[10]:

sex      1
score    1
dtype: int64

결측치 제거하기¶

In [11]:

df.dropna(subset=['score']) # score 결측치 제거

Out[11]:

	sex	score
0	M	5.0
1	F	4.0
2	NaN	3.0
3	M	4.0

In [18]:

df.dropna(subset=['sex', 'score']) # 여러 변수의 결측치 제거

Out[18]:

	sex	score
0	M	5.0
1	F	4.0
3	M	4.0

In [19]:

df.dropna() # 모든 변수의 결측치 제거

Out[19]:

	sex	score
0	M	5.0
1	F	4.0
3	M	4.0

dropna( )는 결측치가 하나라도 있으면 모두 제거하여 간편하지만, 분석에 필요한 행까지 손실되기 때문에 비추
분석에 사용할 변수를 직접 지정해 결측치를 제거하는 방법 권장

결측치 제거없이 분석하기¶

mean과 sum은 결측치가 있어도 자동으로 제거하고 연산한다.

In [20]:

df['score'].mean()

Out[20]:

4.0

In [21]:

df['score'].sum()

Out[21]:

16.0

In [22]:

df.groupby('sex').agg(mean_score = ('score' ,'mean'),
                      sum_score = ('score', 'sum'))

Out[22]:

	mean_score	sum_score
sex
F	4.0	4.0
M	4.5	9.0

자동으로 결측치를 제거하는 기능은 편리하지만, 결측치가 있는지 모른 채로 데이터를 다루게 된다는 위험이 있다.

결측치 대체하기¶

imputation, 결측치 대체법 : 데이터가 손실되어 분석 결과가 왜곡되는 문제를 보완할 수 있음

평균값으로 결측치 대체¶

In [23]:

exam = pd.read_csv('exam.csv')
exam

Out[23]:

	id	nclass	math	english	science
0	1	1	50	98	50
1	2	1	60	97	60
2	3	1	45	86	78
3	4	1	30	98	58
4	5	2	25	80	65
5	6	2	50	89	98
6	7	2	80	90	45
7	8	2	90	78	25
8	9	3	20	98	15
9	10	3	50	98	45
10	11	3	65	65	65
11	12	3	45	85	32
12	13	4	46	98	65
13	14	4	48	87	12
14	15	4	75	56	78
15	16	4	58	98	65
16	17	5	65	68	98
17	18	5	80	78	90
18	19	5	89	68	87
19	20	5	78	83	58

In [24]:

exam.loc[[2, 7, 14], ['math']] = np.nan # 2, 7, 14행의 math에 nan 할당

In [25]:

exam

Out[25]:

	id	nclass	math	english	science
0	1	1	50.0	98	50
1	2	1	60.0	97	60
2	3	1	NaN	86	78
3	4	1	30.0	98	58
4	5	2	25.0	80	65
5	6	2	50.0	89	98
6	7	2	80.0	90	45
7	8	2	NaN	78	25
8	9	3	20.0	98	15
9	10	3	50.0	98	45
10	11	3	65.0	65	65
11	12	3	45.0	85	32
12	13	4	46.0	98	65
13	14	4	48.0	87	12
14	15	4	NaN	56	78
15	16	4	58.0	98	65
16	17	5	65.0	68	98
17	18	5	80.0	78	90
18	19	5	89.0	68	87
19	20	5	78.0	83	58

In [26]:

exam['math'].mean()

Out[26]:

55.23529411764706

In [27]:

exam['math'] = exam['math'].fillna(55)
exam

Out[27]:

	id	nclass	math	english	science
0	1	1	50.0	98	50
1	2	1	60.0	97	60
2	3	1	55.0	86	78
3	4	1	30.0	98	58
4	5	2	25.0	80	65
5	6	2	50.0	89	98
6	7	2	80.0	90	45
7	8	2	55.0	78	25
8	9	3	20.0	98	15
9	10	3	50.0	98	45
10	11	3	65.0	65	65
11	12	3	45.0	85	32
12	13	4	46.0	98	65
13	14	4	48.0	87	12
14	15	4	55.0	56	78
15	16	4	58.0	98	65
16	17	5	65.0	68	98
17	18	5	80.0	78	90
18	19	5	89.0	68	87
19	20	5	78.0	83	58

In [28]:

exam['math'].isna().sum()

Out[28]:

예제¶

In [32]:

import pydataset

In [33]:

mpg = pydataset.data('mpg')
mpg

Out[33]:

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
230	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
231	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
232	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
233	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
234	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

In [34]:

mpg.loc[[64, 123, 130, 152, 211], 'hwy'] = np.nan

In [41]:

mpg[['drv', 'hwy']].isna().sum()

Out[41]:

drv    0
hwy    0
dtype: int64

In [37]:

mpg.groupby('drv')[['hwy']].mean()

Out[37]:

	hwy
drv
4	19.262626
f	28.152381
r	21.000000

In [43]:

mpg = mpg.dropna(subset = ['hwy'])\
         .groupby('drv')\
         .agg(mean_hwy = ('hwy', 'mean'))

In [44]:

mpg

Out[44]:

	mean_hwy
drv
4	19.262626
f	28.152381
r	21.000000

이상치 정제하기¶

이상치 : 정상 범위에서 크게 벗어난 값, anomaly

In [45]:

df = pd.DataFrame({'sex' : [1,2,1,3,2,1],
                   'score' : [5,4,3,4,2,6]})
df

Out[45]:

	sex	score
0	1	5
1	2	4
2	1	3
3	3	4
4	2	2
5	1	6

In [46]:

df['sex'].value_counts().sort_index()

Out[46]:

1    3
2    2
3    1
Name: sex, dtype: int64

In [47]:

df['score'].value_counts().sort_index()

Out[47]:

2    1
3    1
4    2
5    1
6    1
Name: score, dtype: int64

sort_index( ) : 빈도 기준으로 내림차순 정렬하지 않고 변수의 값 순서로 정렬

In [48]:

# sex가 3이면 nan 부여
df['sex'] = np.where(df['sex'] == 3, np.nan, df['sex'])
df

Out[48]:

	sex	score
0	1.0	5
1	2.0	4
2	1.0	3
3	NaN	4
4	2.0	2
5	1.0	6

In [49]:

# score가 5보다 크면 nan 부여
df['score'] = np.where(df['score'] > 5, np.nan, df['score'])
df

Out[49]:

	sex	score
0	1.0	5.0
1	2.0	4.0
2	1.0	3.0
3	NaN	4.0
4	2.0	2.0
5	1.0	NaN

In [50]:

df.dropna(subset=['sex', 'score'])\
  .groupby('sex')\
  .agg(mean_score = ('score', 'mean'))

Out[50]:

	mean_score
sex
1.0	4.0
2.0	3.0

이상치 제거하기¶

outlier, 극단치 : 극단적으로 크거나 작은 값

1. 상자 그림 살펴보기¶

In [55]:

mpg = pydataset.data('mpg')
mpg

Out[55]:

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
230	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
231	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
232	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
233	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
234	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

In [56]:

import seaborn as sns

In [57]:

sns.boxplot(data = mpg, y = 'hwy')

Out[57]:

<AxesSubplot:ylabel='hwy'>

2. 극단치 기준값 구하기¶

In [62]:

# 1사분위수 구하기
pct25 = mpg['hwy'].quantile(.25)
pct25

Out[62]:

18.0

In [63]:

# 3사분위수 구하기
pct75 = mpg['hwy'].quantile(.75)
pct75

Out[63]:

27.0

In [64]:

# IQR(inter quartile range, 사분위 범위) 구하기
iqr = pct75 - pct25
iqr

Out[64]:

9.0

In [65]:

# 극단치 경계(하한) 구하기
pct25 - 1.5 * iqr

Out[65]:

4.5

In [66]:

# 극단치 경계(상한) 구하기
pct75 + 1.5 * iqr

Out[66]:

40.5

3. 극단치를 결측 처리하기¶

In [72]:

# 4.5 ~ 40.5 벗어나면 NAN 부여
mpg['hwy'] = np.where((mpg['hwy'] < 4.5) | (mpg['hwy'] > 40.5), np.nan, mpg['hwy'])
mpg

Out[72]:

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29.0	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29.0	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31.0	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30.0	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26.0	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
230	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28.0	p	midsize
231	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29.0	p	midsize
232	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26.0	p	midsize
233	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26.0	p	midsize
234	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26.0	p	midsize

234 rows × 11 columns

In [73]:

# 결측치 빈도 확인
mpg['hwy'].isna().sum()

Out[73]:

4. 결측치 제거하고 분석하기¶

In [76]:

# drv(구동 방식)에 따라 hwy(고속도로 연비) 평균을 구해보자
mpg.dropna(subset=['hwy'])\
   .groupby('drv')\
   .agg(mean_hwy = ('hwy', 'mean'))

Out[76]:

	mean_hwy
drv
4	19.174757
f	27.728155
r	21.000000

예제¶

In [88]:

mpg = pydataset.data('mpg')
mpg

Out[88]:

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
230	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
231	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
232	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
233	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
234	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

In [89]:

# 이상치 할당하기
mpg.loc[[9, 13, 57, 92], 'drv'] = 'k'
mpg.loc[[28, 42, 128, 202], 'cty'] = [3, 4, 39, 42]

In [93]:

# drv에 이상치 있는지 확인하기
mpg['drv'].value_counts().sort_index()

Out[93]:

4    100
f    106
k      4
r     24
Name: drv, dtype: int64

In [96]:

# 이상치를 결측 처리하기
mpg['drv'] = np.where(mpg['drv'].isin(['f', 4, 'r']), mpg['drv'], np.nan)

In [97]:

# 결측치 빈도 확인
mpg['drv'].isna().sum()

Out[97]:

In [98]:

# 상자그림 이용하여 cty에 이상치 있는지 확인하기
sns.boxplot(data = mpg, y = 'cty')

Out[98]:

<AxesSubplot:ylabel='cty'>

In [101]:

pct25 = mpg['cty'].quantile(.25)
pct25

Out[101]:

14.0

In [103]:

pct75 = mpg['cty'].quantile(.75)
pct75

Out[103]:

19.0

In [104]:

iqr = pct75 - pct25
iqr

Out[104]:

5.0

In [106]:

# 하한
pct25 - 1.5 * iqr

Out[106]:

6.5

In [107]:

# 상한
pct75 + 1.5 * iqr

Out[107]:

26.5

In [108]:

# 이상치 결측 처리하기
mpg['cty'] = np.where((mpg['cty'] < 6.5) | (mpg['cty'] > 26.5), np.nan, mpg['cty'])
mpg['cty'][:5]

Out[108]:

1    18.0
2    21.0
3    20.0
4    21.0
5    16.0
Name: cty, dtype: float64

In [109]:

# 다시 상자그림 이용하여 이상치 제거 여부 확인하기
sns.boxplot(data = mpg, y = 'cty')

Out[109]:

<AxesSubplot:ylabel='cty'>

In [112]:

mpg.dropna(subset = ['drv', 'cty']).groupby('drv').agg(mean_cty = ('cty', 'mean'))

Out[112]:

	mean_cty
drv
f	19.470000
r	13.869565

py린이

07. 데이터 정제하기(결측치, 이상치)

결측치 정제하기¶

결측치 확인하기¶

결측치 제거하기¶

결측치 제거없이 분석하기¶

결측치 대체하기¶

평균값으로 결측치 대체¶

예제¶

이상치 정제하기¶

이상치 제거하기¶

1. 상자 그림 살펴보기¶

2. 극단치 기준값 구하기¶

3. 극단치를 결측 처리하기¶

4. 결측치 제거하고 분석하기¶

예제¶

'Do it 파이썬 데이터 분석'의 다른글

티스토리툴바

« 2025/09 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30