Skip to content

Basics of Pandas x Numpy – Useful Techniques

Below are some of the useful techniques that will save you time and energy whenever you’re using pandas! These are the useful techniques I have learned based on my consultation using the pandas library.

Import required libraries and data.












import pandas as pd
from pandas import DataFrame
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

pd.read_csv("data_1.csv")
Import required libraries and data.
Import required libraries and data.

2 useful “read_csv” parameters that are often overlooked:

  1. index_col: column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int/str is given, a MultiIndex is used
  2. encoding: encoding to use for UTF when reading/writing (ex. ‘utf-8’)

You can remove an “Unnamed: 0” column in your DataFrame.



pd.read_csv("data_1.csv", index_col=0, encoding='utf-8')
2 useful "read_csv" parameters that are often overlooked:
2 useful “read_csv” parameters that are often overlooked:

Explore your data using these:

  1. df.dtypes: data type of array elements
  2. df.shape: array dimensions
  3. len(df): length of array
  4. df.head(): return first DataFrame rows
  5. df.tail(): return last DataFrame rows




pd.read_csv("data_1.csv", index_col=0, encoding='utf-8').dtypes
pd.read_csv("data_1.csv", index_col=0, encoding='utf-8').shape
len(pd.read_csv("data_1.csv", index_col=0, encoding='utf-8'))
Data Exploration 1
Data Exploration 1


pd.read_csv("data_1.csv", index_col=0, encoding='utf-8').head(10)
df.head(): return first DataFrame rows
df.head(): return first DataFrame rows


pd.read_csv("data_1.csv", index_col=0, encoding='utf-8').tail(10)
df.tail(): return last DataFrame rows
df.tail(): return last DataFrame rows

Convert from the data frame into a list.





df1 = pd.read_csv("data_1.csv", index_col=0, encoding='utf-8')
model_list = df1['model'] 
.tolist()
model_list
Convert from the data frame into a list.

Select a specific column.




df2 = pd.read_csv("data_2.csv", index_col=0, encoding='utf-8')
df2['brand'] 
Select a specific column.
Select a specific column.

Data extraction using regex







df2 = pd.read_csv("data_2.csv", index_col=0, encoding='utf-8')
regex = f'{"|".join(model_list)}'
df2['model']  = df2['brand'] 
.str.lower().str.extract(f'(?P<model>{regex})')
df2['model'] .tail(100)
Data extraction using regex
Data extraction using regex

Delete column.


del df2['link'] 
Delete column.
Delete column.

Fill up missing data.




df2.dropna()
df2.fillna("Unknown",inplace=True)
Fill up missing data.
Fill up missing data.

Remove specific rows based on condition.


df2 = df2[df2.model != "Unknown"] 
Remove specific rows based on condition.
Remove specific rows based on condition.

A lambda function that left and right strip the string.


df3['built_up']  = df3['built_up'] .map(lambda x: x.lstrip('From  Up to ').rstrip(' sq. ft.'))
A lambda function that left and right strip the string
A lambda function that left and right strip the string
A lambda function that left and right strip the string
A lambda function that left and right strip the string

Find the average of the array’s elements.


df3['built_up']  = [np.mean(pd.to_numeric(x.strip('- ').split('-'), errors='coerce')) for x in df3.built_up] 
Find the average of the array's elements.
Find the average of the array’s elements.

Python histogram plotting



df3.hist(column='built_up')
Python histogram plotting
Python histogram plotting

Replace values with others


df3['bedrooms']  = df3['bedrooms'] .str.replace('Studio', '1')

Evaluate a string describing operations on DataFrame columns.


df3['bedrooms']  = df3['bedrooms'] .str.rstrip('+- ').apply(pd.eval)
Evaluate a string describing operations on DataFrame columns.
Evaluate a string describing operations on DataFrame columns.

Convert into a numeric data type.


df3['monthly_installment']  = pd.to_numeric(df3['monthly_installment'] )
Convert into a numeric data type.
Convert into a numeric data type.

Make a horizontal bar plot based on categorical data.



df3.type.value_counts().plot(kind='barh', figsize=(10, 10))
Make a horizontal bar plot based on categorical data.
Make a horizontal bar plot based on categorical data.

Get a statistic summary.



df3.describe()
Get a statistic summary.
Get a statistic summary.

Detect and remove outliers using IQR.










Q1 = df3.quantile(0.25)
Q3 = df3.quantile(0.75)
IQR = Q3 - Q1

df3 = df3[~((df3 < (Q1 - 1.5 * IQR)) |(df3 > (Q3 + 1.5 * IQR))).any(axis=1)] 
Detect and remove outliers using IQR.
Detect and remove outliers using IQR.

Convert categorical data into numerical data.























cleanup_nums = {"type":{"Bungalow/Villa": 1, 
                        "Apartment/Condo/Service Residence": 2,
                        "Semi-Detached House": 3,
                        "Terrace/Link House": 4,
                        "Residential Land": 5},
                "land_title":{"Industrial": 1, 
                              "Commercial": 2, 
                              "Unknown": 3, 
                              "Residential": 4},
                "tenure":{"Malay Reserved Land":1,
                          "Lease Tenancy":2,
                          "Leasehold":3,
                          "Unknown":4,
                          "Freehold":5},
                "furnishing":{"Fully Furnished":1,
                              "Unfurnished":2,
                              "Unknown":3,
                              "Partly Furnished":4}
               }

df3.replace(cleanup_nums, inplace=True)

Split into training and testing data frame based on percentage.





msk = np.random.rand(len(df3)) < 0.8
df_train = df3[msk] 

df_test = df3[~msk] 

Write to the CSV file.



cols_to_keep = ['link', 'name', 'type', 'price', 'address', 'built_up', 'bedrooms', 'bathrooms', 'monthly_installment', 'land_title','tenure','furnishing', 'state'] 

df3[cols_to_keep] .to_csv('data_4.csv')

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *