The first article in this series shed some light on the different methods of encoding character attributes for creating useful machine learning models. Here in this piece, we will focus on manipulating and extracting useful text out of the messy strings using R.
To reiterate the essential foundation in our previous article, character or string data dominates datasets in enterprises, making it hard to create a very accurate machine learning model. We have to clean messy strings, pull strings apart, and extract useful strings embedded in a text to bring it into a form that can be used in a machine learning pipeline.
Below are some advantages of using stringr:
- Consistent function names and descriptive input parameters.
- Built-in pattern matching and regex functions.
- Deals with missing data by default.
- Datatype of input and output strings are preserved.
Now, let’s explore the different functions available in the stringr package. We will use U.S. Carriers flight data, which can be downloaded from Bureau of Transporation Statistics website. Once the data is downloaded, load the stringr library and read the file into the R environment as shown below:
library(stringr)
flights <- read.csv("606231461_T_T100D_MARKET_US_CARRIER_ONLY")
The column UNIQUE_CARRIER_NAME has names of the carriers as strings. We will use this attribute to explore the stringr functionality.
str_detectis used to find a pattern in a string. For instance,str_detect(flights$UNIQUE_CARRIER_NAME,"Tradewind")returnsTRUEwhen any pattern in the strings matched Tradewind andFALSEwhen there is no match.str_extractextracts the string that matches the pattern. For example,str_extract(flights$UNIQUE_CARRIER_NAME,"Tradewind")searches for Tradewind in every string and extracts it whenever there is a match.str_lengthretrieves the length of each string that is present in the attribute.str_length(flights$UNIQUE_CARRIER_NAME)returns the length of carrier names present in theUNIQUE_CARRIER_NAMEcolumn.str_locatereturns the position of the input string pattern. For example, for the flight’s dataset,str_locate(flights$UNIQUE_CARRIER_NAME,"Trade")returns the start as 1 and the end as 5 — which means that the patternTradeis present from the first to the fifth position in the data for theUNIQUE_CARRIER_NAMEcolumnstr_replaceis used widely. There are times where we need to replace some text patterns with another string. This function comes in handy here where it replaces the first occurrence of a matched pattern in a string. For instance,str_replace(flights$UNIQUE_CARRIER_NAME,"Tradewind","Air")replaces Tradewind with Air. After this replacement, the carrier Tradewind Aviation is changed to Air Aviation. Cool makeover. Hope Tradewind Aviation likes this new branding.str_splitbreaks up a string based on the pattern provided. For example,str_split(flights$UNIQUE_CARRIER_NAME,"Air")splits “GoJet Airlines LLC d/b/a United Express” to “GoJet” and “lines LLC d/b/a United Express”.str_subis similar to nativesubstrfunction; it returns a substring from a character vector. For example,str_sub(flights$UNIQUE_CARRIER_NAME,1,3)returns “Tra” for Tradewind Aviation.str_trimis a useful function which trims the whitespaces at the beginning and end of a string. The commandstr_trim(" Airlines ")trims the whitespaces and returns just “Airlines”. Similarly,str_trim(" GoJet Airlines ")trims the leading and trailing whitespaces and returns “GoJet Airlines”. Note the space in between “GoJet Airlines” is not trimmed.
These are some of the handy functions in stringr that are often used. There are some more functions in the package that are less commonly used but are good to know. You can refer to the R documentation for exploring those methods. stringr is one of the necessary packages in a data science toolbox, and if you have read this long, you are ready to manipulate strings in R with ease.
