• Welcome !
  • Mail us: contact@analytickast.com
Analytickast.com Analytickast.com
  • Home
  • Blog
  • Coaching
    • Course Dashboard
    • Instructor Registration
    • Student Registration
  • Shop Now
  • Contact Us
  • My account
    • Cart
    • Checkout
  • Log In

Signup

Process Strings in R Using stringr

Process Strings in R Using stringr

Explore the different functions available in the stringr package in R using the interesting U.S. Carriers flight dataset.

The first article in this series shed some light on the different methods of encoding character attributes for creating useful machine learning models. Here in this piece, we will focus on manipulating and extracting useful text out of the messy strings using R.

To reiterate the essential foundation in our previous article, character or string data dominates datasets in enterprises, making it hard to create a very accurate machine learning model. We have to clean messy strings, pull strings apart, and extract useful strings embedded in a text to bring it into a form that can be used in a machine learning pipeline.

Below are some advantages of using stringr:

  • Consistent function names and descriptive input parameters.
  • Built-in pattern matching and regex functions.
  • Deals with missing data by default.
  • Datatype of input and output strings are preserved.

Now, let’s explore the different functions available in the stringr package. We will use U.S. Carriers flight data, which can be downloaded from Bureau of Transporation Statistics website. Once the data is downloaded, load the stringr library and read the file into the R environment as shown below:

library(stringr)
flights <- read.csv("606231461_T_T100D_MARKET_US_CARRIER_ONLY")

The column UNIQUE_CARRIER_NAME has names of the carriers as strings. We will use this attribute to explore the stringr functionality.

  • str_detect is used to find a pattern in a string. For instance, str_detect(flights$UNIQUE_CARRIER_NAME,"Tradewind") returns TRUE when any pattern in the strings matched Tradewind and FALSE when there is no match.
  • str_extract extracts the string that matches the pattern. For example, str_extract(flights$UNIQUE_CARRIER_NAME,"Tradewind") searches for Tradewind in every string and extracts it whenever there is a match.
  • str_length retrieves the length of each string that is present in the attribute. str_length(flights$UNIQUE_CARRIER_NAME) returns the length of carrier names present in the UNIQUE_CARRIER_NAME column.
  • str_locate returns the position of the input string pattern. For example, for the flight’s dataset, str_locate(flights$UNIQUE_CARRIER_NAME,"Trade") returns the start as 1 and the end as 5 — which means that the pattern Trade is present from the first to the fifth position in the data for the UNIQUE_CARRIER_NAME column
  • str_replace is used widely. There are times where we need to replace some text patterns with another string. This function comes in handy here where it replaces the first occurrence of a matched pattern in a string. For instance, str_replace(flights$UNIQUE_CARRIER_NAME,"Tradewind","Air") replaces Tradewind with Air. After this replacement, the carrier Tradewind Aviation is changed to Air Aviation. Cool makeover. Hope Tradewind Aviation likes this new branding.
  • str_split breaks up a string based on the pattern provided. For example,  str_split(flights$UNIQUE_CARRIER_NAME,"Air") splits “GoJet Airlines LLC d/b/a United Express” to “GoJet” and “lines LLC d/b/a United Express”.
  • str_sub is similar to native substr function; it returns a substring from a character vector. For example, str_sub(flights$UNIQUE_CARRIER_NAME,1,3) returns “Tra” for Tradewind Aviation.
  • str_trim is a useful function which trims the whitespaces at the beginning and end of a string. The command str_trim(" Airlines ") trims the whitespaces and returns just “Airlines”. Similarly, str_trim(" GoJet Airlines ") trims the leading and trailing whitespaces and returns “GoJet Airlines”. Note the space in between “GoJet Airlines” is not trimmed.

These are some of the handy functions in stringr that are often used. There are some more functions in the package that are less commonly used but are good to know. You can refer to the R documentation for exploring those methods. stringr is one of the necessary packages in a data science toolbox, and if you have read this long, you are ready to manipulate strings in R with ease.

Categories: Machine Learning
Prev Post
Next Post

Add your Comment

Recent Posts

  • Insights on Data Science Automation for Big Data and IoT Environments
  • The Changing Landscape: Data Science Trends
  • Streamline the Machine Learning Process Using Apache Spark ML Pipelines
  • Dive Deep Into Deep Learning
  • CEP Patterns for Stream Analytics

Recent Comments

    Archives

    • June 2020

    Post Categories

    • Business Analytics
    • Machine Learning
    • Popular Content

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    About AnalyticKast

    Author

    Our goal is to provide easy access to people on data technologies related information to thrive in this digital economy.

    Latest Posts

    Insights on Data Science Automation for Big Data and IoT Environments

    Insights on Data Science Automation for Big Data and IoT Environments

    June 30, 2020

    The Changing Landscape: Data Science Trends

    June 30, 2020

    Streamline the Machine Learning Process Using Apache Spark ML Pipelines

    June 30, 2020

    About Analytickast

    One-stop knowledge services platform that supports individuals connect the dots between technologies and management to build data products. Our goal is to provide easy access to people on data technologies related information to thrive in this digital economy.

    Blogs

    • Business Analytics
    • Machine Learning
    • Popular Content

    Quick Links

    • Home
    • Blog
    • Coaching
    • Shop Now
    • Contact Us
    • My account
    • Log In

    Our Videos

    All Rights Reserved © 2020. - www.analytickast.com .

    • Privacy Policy
    • Legal Disclaimer
    • Terms of Use