• Welcome !
  • Mail us: contact@analytickast.com
Analytickast.com Analytickast.com
  • Home
  • Blog
  • Coaching
    • Course Dashboard
    • Instructor Registration
    • Student Registration
  • Shop Now
  • Contact Us
  • My account
    • Cart
    • Checkout
  • Log In

Signup

Clustering DZone Articles Using R

Clustering DZone Articles Using R

Instead of using random open-source data, I used DZone articles for this experiment. I randomly picked articles from different Zones and checked whether the R code said they were sorted into the correct Zone.

In my last article, we focused on creating topic models for text data. Today, we will focus on clustering text documents. At first sight, topic modeling and document clustering seem to be the same thing. But there is a subtle difference between the two. Topic modeling helps find latent topics in all documents, which we can think of as the hidden treasures/most discussed elements in a set of documents. On the other hand, document clustering is about grouping documents into different homogeneous categories. This can help discover certain common patterns/traits among similar documents that are helpful for specific use cases like segmenting news articles, grouping comments/forum discussions, etc.

Instead of using some random open-source data from the internet, I chose to use the articles at DZone for this demonstration. I randomly picked up different articles from different Zones on DZone and checked whether the code correctly segmented them as per the Zones.

Documents Clustering Process

This is primarily a two-step process.

Step 1

Preprocess the documents to convert them into word vectors. The DTM (document term matrix) is a numerical representation of each document where the value of each term indicates the importance of that term in the document.

One of the most decisive judgments that arises while clustering is whether to use DTM or TDM for clustering. As a rule of thumb, in DTM, the documents are rows and terms are columns, while in TDM, the terms are rows and documents are columns.

If we want to cluster documents, DTM is more appropriate, as the document forms the instances and the terms are the attributes that contribute in grouping the documents. Sometimes, we might face a situation where the objective is term clustering. In this case, we can go with TDM — but topic modeling is really better-suited for this kind of use case.

Step 2

The matrix is then fed to clustering algorithms to group similar instances (i.e. documents) together into clusters.

Easy, right? Well, it is. But the only thing to watch out for is noise in the text. Preparing text and cleaning it to create a model is a Herculean task.

We will continue our example from my previous article. We have added three more DZone articles from different zones to illustrate the clustering process better.

  • Documents 1 and 2:These are two Big Data Zone articles written by me.
  • Document 3:A Cloud Zone spotlight article from the week of December 3, 2017.
  • Documents 4 and 5: Two database zone articles that were spotlighted for the same week.

Image title

The texts are unlabelled, and we will use clustering algorithms to group the documents based on their similarity. We will follow the document clustering process and preprocess the documents into a DTM. The below code is reused from our previous article. In short, the code will tokenize the articles and reduce feature space by removing numbers, punctuation, stop words, etc. The corpus will then be converted to a DTM.

library(tm)
library(SnowballC)
#set working directory (modify path as needed)
setwd("//Users//sibanjan//datascience//studies//textmining//documentclustering")
#load files into corpus
#get listing of .txt files in directory
filenames <- list.files(getwd(),pattern="*.txt")
#read files into a character vector
files <- lapply(filenames,readLines)
#create corpus from vector
articles.corpus <- Corpus(VectorSource(files))

articleDtm <- DocumentTermMatrix(articles.corpus, control = list(minWordLength = 3));
articleDtm2 <- removeSparseTerms(articleDtm, sparse=0.98)
articles_mat <- as.matrix(articleDtm2)

Once the DTM matrix is prepared, we can use any of the standard ML clustering algorithms to create clusters. The values are numbers and so that distance-based algorithms like hierarchical or k-means work well. The decision to choose which algorithm is better for a use case follows the same rules that we used for clustering structured data. Here, we use the hclust algorithm. We set the number of clusters to 3, as we had articles picked up from three different Zones.

distMatrix <- dist(articles_mat, method="euclidean")
groups <- hclust(distMatrix,method="ward.D")
plot(groups, cex=0.9)
rect.hclust(groups, k=3)

Image title

The cluster plot, known as a dendrogram, correctly groups Documents 1 and 2 to one group, Documents 4 and 5 as another group, and Document 3 in a separate group. The code does a fair job of grouping the articles based on the words used in those pieces. The code and data used for this illustration are on my GitHub.

This also proves that DZone editors are doing a fabulous job of assigning the right piece to the correct Zone for us to quickly skim through the knowledge nuggets. Thank you, Sarah Davis (DZone AI and Big Data Zone Editor who spends the best part of her day to read through, assign zone, schedule, and give the final touch to each article submitted for us to learn something new daily).

Categories: Machine Learning
Prev Post
Next Post

Add your Comment

Recent Posts

  • Insights on Data Science Automation for Big Data and IoT Environments
  • The Changing Landscape: Data Science Trends
  • Streamline the Machine Learning Process Using Apache Spark ML Pipelines
  • Dive Deep Into Deep Learning
  • CEP Patterns for Stream Analytics

Recent Comments

    Archives

    • June 2020

    Post Categories

    • Business Analytics
    • Machine Learning
    • Popular Content

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    About AnalyticKast

    Author

    Our goal is to provide easy access to people on data technologies related information to thrive in this digital economy.

    Latest Posts

    Insights on Data Science Automation for Big Data and IoT Environments

    Insights on Data Science Automation for Big Data and IoT Environments

    June 30, 2020

    The Changing Landscape: Data Science Trends

    June 30, 2020

    Streamline the Machine Learning Process Using Apache Spark ML Pipelines

    June 30, 2020

    About Analytickast

    One-stop knowledge services platform that supports individuals connect the dots between technologies and management to build data products. Our goal is to provide easy access to people on data technologies related information to thrive in this digital economy.

    Blogs

    • Business Analytics
    • Machine Learning
    • Popular Content

    Quick Links

    • Home
    • Blog
    • Coaching
    • Shop Now
    • Contact Us
    • My account
    • Log In

    Our Videos

    All Rights Reserved © 2020. - www.analytickast.com .

    • Privacy Policy
    • Legal Disclaimer
    • Terms of Use

    Insert/edit link

    Enter the destination URL

    Or link to existing content

      No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.