Handling Missing Data

Pradyumn Joshi
Pradyumn Joshi
Published in
3 min readJun 10, 2021

--

(10 June 21) This post talks about the methods of filling null values.

The most important part of data science is data pre-processing. One of its part is handling missing value.There are many tools through which we can handle null values, the most common one is Pandas.

You can identify null values in your dataset , its represented as NaN. For learning about the concept of handling null values we’ll be working on titanic data-set: Titanic

import pandas as pd
df = pd.read_csv('train.csv')
df.isna().sum()
Output

So, as you can see there are 177 null values in ‘Age’ column, 687 in ‘Cabin’ and 2 in ‘Embarked’.

‘Embarked’ column tells us about the boarding station of a passenger and it is not needed in prediction of survival of passenger so, we are going to drop this column. The ‘Cabin’ column is also not needed in the prediction of survival rate so we are going to drop this also.

df = df.drop(['Embarked','Cabin'],axis=1)

After running this you will see an updated dataset with no column named ‘Embarked’ and ‘Cabin.’

There are three basic methods to fill data :

  1. Mean : It is arithmetic average of the data set
Formula of mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

2. Median : It is the middle number in a data set when the numbers are listed in either ascending or descending order.

For odd number of abservation
For even number of observation
df['Age'] = df['Age'].fillna(df['Age'].median())

3. Mode : It is the number which comes frequently in the data set.

df['Age'] = df['Age'].fillna(df['Age'].mode())

Other than these three traditional methods there are 2 more methods which are used :

  1. Backward Fill : It fills the null value with the previous value in the column.
df['Age'] = df['Age'].fillna(method='bfill')

2. Forward Fill : It fills the null value with the next value in the column.

df['Age'] = df['Age'].fillna(method='ffill')

Out of these five methods you can choose any method to handle null value depending on the case. For, this case we are going to use median as we will not have outliers in the dataset which will help in attaining good accuracy.

Thank you for reading! Have a good day!

Follow for more related content!

--

--