There is an estimate that around 80% of world’s data is unstructured. This includes audio, video, and text data. In this piece, we will focus our discussion on text data only. Later in the series, we will shift to other unstructured data.
The books, blogs, news articles, web pages, e-mail messages, etc. that we read are all text data. All these texts provide us with masses of information, and it keeps growing constantly. However, not all data are useful. We filter out the noise and keep only the information that is important. This is a tedious process, but we, being humans, need intelligence — and reading is an essential tool. Also, when the world is bent towards smart machines, the ability to process information from unstructured data is a must. Mining information out of the enormous volumes of text data is required for both humans and smart machines. Text mining can provide methods to extract, summarize, and analyze useful information from unstructured data to derive new insights.
Text mining can be used for various tasks. Below are a few topics that will be covered further in our series:
- Topic modeling
- Document clustering
- Document categorization
- Text summarization
This post focuses on topic modeling. In subsequent posts, we will dive into other tasks.
A text file can come in various formats like PDF, DOC, HTML, etc. The first step is to convert these documents into a readable text format. Next, a corpus must be created. A corpus is simply a collection of one or more documents. When we create a corpus in R, the text is tokenized and available for further processing.
#set working directory (modify path as needed)
#load files into corpus
#get listing of .txt files in directory
filenames <- list.files(getwd(),pattern="*.txt")
#read files into a character vector
files <- lapply(filenames,readLines)
#create corpus from vector
articles.corpus <- Corpus(VectorSource(files))
Next, we need to preprocess the text to convert it into a format that can be processed for extracting information. It is essential to reduce the size of the feature space before analyzing the text. There are various preprocessing methods that we can use here, such as stop word removal, case folding, stemming, lemmatization, and contraction simplification. However, it is not necessary to apply all of the normalization methods to the text. It depends on the data we retrieve and the kind of analysis to be performed.
# make each letter lowercase
articles.corpus <- tm_map(articles.corpus, tolower)
# remove punctuation
articles.corpus <- tm_map(articles.corpus, removePunctuation)
articles.corpus <- tm_map(articles.corpus, removeNumbers);
# remove generic and custom stopwords
stopword <- c(stopwords('english'), "best");
articles.corpus <- tm_map(articles.corpus, removeWords, stopword)
articles.corpus <- tm_map(articles.corpus, stemDocument);
Below is a short description of preprocessing methods we applied to reduce the feature space of our dataset:
- Punctuation removal: Various punctuation marks such as +, –, and ~ were removed.
- Stop word removal: Stop words, such as common and short function words, are filtered out for the effective analysis of the data. The standard English stop word list provided by NLTK was used along with a custom collection of words to eliminate informal words and product names. We can also provide words from our text that we feel are not relevant to our analysis.
- Case folding: Case folding converts all upper-case letters to lower-case letters.
- Stemming: Stemming is the process of reducing the modified or derived words to their root form. For example, working and worked are stemmed to work.
- Number removal: For some text mining activities, numbers are not essential. For example, in case of topic modeling, we are concerned with finding the essential words that describe our corpus. In such cases, we can remove numbers. However, in some cases, for example, if we are doing a topic modeling for a corpus of financial statements, they might add substance.
The next step is to create a document-term matrix (DTM). It is an important step because to interpret and analyze the text files, they must ultimately be converted into a document-term matrix. The DTM contains the number of term occurrences per document. The rows in a DTM represent the documents, and each term in a document is represented as a column. We also remove the low-frequency words (sparse terms) after converting the corpus into the document-term matrix.
articleDtm <- DocumentTermMatrix(articles.corpus, control = list(minWordLength = 3));
articleDtm2 <- removeSparseTerms(articleDtm, sparse=0.98)
Topic modeling is about finding essential words/terms in a collection of documents that best represents the collection. Latent dirichlet allocation (LDA) models are a widely used topic modeling technique. You can learn more about LDA here and here.
k = 5;
SEED = 1234;
article.lda <- LDA(articleDtm2, k, method="Gibbs", control=list(seed = SEED))
lda.topics <- as.matrix(topics(article.lda))
lda.terms <- terms(article.lda)
The results above show that the topics for these two documents are concentrated in the area of machine learning and data science. This is precisely as per our expectation as I picked up my previous two articles on AI and data science for the exercise.
You can find the dataset and code from my GitHub.