The first article in this series shed some light on the different methods of encoding character attributes for creating useful machine learning models. Here in this piece, we will focus on manipulating and extracting useful text out of the messy strings using R.
To reiterate the essential foundation in our previous article, character or string data dominates datasets in enterprises, making it hard to create a very accurate machine learning model. We have to clean messy strings, pull strings apart, and extract useful strings embedded in a text to bring it into a form that can be used in a machine learning pipeline.
Below are some advantages of using stringr
:
- Consistent function names and descriptive input parameters.
- Built-in pattern matching and regex functions.
- Deals with missing data by default.
- Datatype of input and output strings are preserved.
Now, let’s explore the different functions available in the stringr
package. We will use U.S. Carriers flight data, which can be downloaded from Bureau of Transporation Statistics website. Once the data is downloaded, load the stringr
library and read the file into the R environment as shown below:
library(stringr)
flights <- read.csv("606231461_T_T100D_MARKET_US_CARRIER_ONLY")
The column UNIQUE_CARRIER_NAME
has names of the carriers as strings. We will use this attribute to explore the stringr
functionality.
str_detect
is used to find a pattern in a string. For instance,str_detect(flights$UNIQUE_CARRIER_NAME,"Tradewind")
returnsTRUE
when any pattern in the strings matched Tradewind andFALSE
when there is no match.str_extract
extracts the string that matches the pattern. For example,str_extract(flights$UNIQUE_CARRIER_NAME,"Tradewind")
searches for Tradewind in every string and extracts it whenever there is a match.str_length
retrieves the length of each string that is present in the attribute.str_length(flights$UNIQUE_CARRIER_NAME)
returns the length of carrier names present in theUNIQUE_CARRIER_NAME
column.str_locate
returns the position of the input string pattern. For example, for the flight’s dataset,str_locate(flights$UNIQUE_CARRIER_NAME,"Trade")
returns the start as 1 and the end as 5 — which means that the patternTrade
is present from the first to the fifth position in the data for theUNIQUE_CARRIER_NAME
columnstr_replace
is used widely. There are times where we need to replace some text patterns with another string. This function comes in handy here where it replaces the first occurrence of a matched pattern in a string. For instance,str_replace(flights$UNIQUE_CARRIER_NAME,"Tradewind","Air")
replaces Tradewind with Air. After this replacement, the carrier Tradewind Aviation is changed to Air Aviation. Cool makeover. Hope Tradewind Aviation likes this new branding.str_split
breaks up a string based on the pattern provided. For example,str_split(flights$UNIQUE_CARRIER_NAME,"Air")
splits “GoJet Airlines LLC d/b/a United Express” to “GoJet” and “lines LLC d/b/a United Express”.str_sub
is similar to nativesubstr
function; it returns a substring from a character vector. For example,str_sub(flights$UNIQUE_CARRIER_NAME,1,3)
returns “Tra” for Tradewind Aviation.str_trim
is a useful function which trims the whitespaces at the beginning and end of a string. The commandstr_trim(" Airlines ")
trims the whitespaces and returns just “Airlines”. Similarly,str_trim(" GoJet Airlines ")
trims the leading and trailing whitespaces and returns “GoJet Airlines”. Note the space in between “GoJet Airlines” is not trimmed.
These are some of the handy functions in stringr
that are often used. There are some more functions in the package that are less commonly used but are good to know. You can refer to the R documentation for exploring those methods. stringr
is one of the necessary packages in a data science toolbox, and if you have read this long, you are ready to manipulate strings in R with ease.