Before implementing any algorithm on the given data, It is a best practice to explore it first so that you can get an idea about the data. Today, we will learn how to check for missing/Nan/NULL values in data.
1. Reading the data
Reading the csv data into storing it into a pandas dataframe.
2. Exploring data
Checking out the data, how it looks by using head command which fetch me some top rows from dataframe.
3. Checking NULLs
Pandas is proving two methods to check NULLs - isnull() and notnull()
These two returns TRUE and FALSE respectively if the value is NULL. So let's check what it will return for our data
isnull() test
notnull() test
Check 0th row, LoanAmount Column - In isnull() test it is TRUE and in notnull() test it is FALSE. It mean, this row/column is holding null.
But we will not prefer this way for large dataset, as this will return TRUE/FALSE matrix for each data point, instead we would interested to know the counts or a simple check if dataset is holding NULL or not.
Use any()
Python also provide any() method which returns TRUE if there is at least single data point which is true for checked condition.
Use all()
Returns TRUE if all the data points follow the condition.
Now, as we know that there are some nulls/NaN values in our data frame, let's check those out -
data.isnull().sum() - this will return the count of NULLs/NaN values in each column.
data.isnull().sum().sum()
If you want to get any particular column's NaN calculations -
Here, I have attached the complete Jupyter Notebook for you -
If you want to download the data, You can get it from HERE.
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/