There are times we know things but can’t execute them immediately. For example, we are working on a task which requires us to transform a categorical variable. It is effortless for us to tell one hot encoding or label encoding would be the appropriate technique to convert categorical variables to an equivalent numeric format. However, when we start writing the code, we face difficulty. First, we search for the codes over the internet. It is a time-consuming task and is repetitive exercise. Being in the field of Machine Learning and Artificial Intelligence (AI), we should streamline our work before we automate the world. CARET is an excellent package that has most of the functions we need while working in R. But sometimes that’s not enough, and seldom we require to work on things that are not available in CARET.
So, I have started cooking a few R codes that would be handy for me when I work on R and thought to share with you all.
Categorical Treatment
One hot encode and Label encode function for transforming categorical data.
one_hot_encode = function(outcome, vars, df){
# Load the package vtreat
library(vtreat)
library(magrittr)
# Create the treatment plan
treatplan <- designTreatmentsZ(df, vars, verbose = FALSE)
# Prepare the training data
temp.treat <- prepare(treatplan, df)
# join treatment dat with original data
temp.clean <- cbind(df[,!(names(df) %in% vars)], temp.treat)
temp.clean
}
label_encode = function(vars){
as.factor(vars)
}
label_encode_xgboost = function(vars){
as.numeric(vars)
}
Temporal Data Treatment
It is very essential to create features out of the temporal attribute for using it to build a supervised learning model. The below time_features function will create 11 new attributes out of a temporal variable.
library(lubridate)
time_features = function(time, col_name)
{
numeric_time <- as.numeric(time)
day_of_week <- wday(time)
day_of_month <- mday(time)
day_of_quarter <- qday(time)
day_of_year <- yday(time)
hr_of_day <- hour(time)
min_of_day <- 60*hour(time) + minute(time)
sec_of_day <- 3600*hour(time) + 60*minute(time) + second(time)
week_of_year <- week(time)
month_of_year <- month(time)
year <- year(time)
df_temp <- data.frame(numeric_time,
day_of_week,
day_of_month,
day_of_quarter,
day_of_year,
hr_of_day,
min_of_day,
sec_of_day,
week_of_year,
month_of_year,
year
)
time_df <- setNames(df_temp, paste(col_name, names(df_temp),sep="_"))
return(time_df)
}
Numerical Binning
Sometimes it is required to convert continuous numerical to discrete data. For example, Naive Bayes and Apriori algorithm work properly when the values are discrete. The below function employs equiwidth binning to convert continuous data to discrete format.
#set.seed(1)
equi_width_binning = function(input, no_of_bins){
#Equi width binning
bins<-no_of_bins #10
minimumVal<-min(input, na.rm=TRUE)
minimumVal
maximumVal<-max(input, na.rm=TRUE)
maximumVal
width=(maximumVal-minimumVal)/bins;
width
bins <- cut(input, breaks=seq(minimumVal, maximumVal, width))
#browser()
bins
}
This is just the beginning. We will continue creating similar modules for the tasks that are redundant. You can download the codes from my github and start using them. If you need something in R to be modularized or want to contribute, feel free to add your code to the project and help us out.