• Welcome !
  • Mail us: contact@analytickast.com
Analytickast.com Analytickast.com
  • Home
  • Blog
  • Coaching
    • Course Dashboard
    • Instructor Registration
    • Student Registration
  • Shop Now
  • Contact Us
  • My account
    • Cart
    • Checkout
  • Log In

Signup

Deciphering K-Means Algorithm

Deciphering K-Means Algorithm

This article is an excerpt from a book and looks at how to decipher the k-means algorithm.

The k-means is a simple algorithm that divides the data set into k partitions for n objects where k ≤ n. In this method, the data set is partitioned into homogeneous groups with similar characteristics. The similarity or dissimilarity is defined by calculating the distance between the centroid and data points. The clusters are formed when the optimization function for the algorithm achieves its objective — less intracluster distances and more intercluster distances.

The following steps are involved in traditional k-means clustering:

  1. Specify the number of clusters and metrics to calculate the
    distance
  2. Randomly pick the initial centroid per number of clusters
  3. For each data point
    • Calculate the distance between the centroid and data
      points
    • Assign each data point to the nearest centroid of a cluster
  4. Calculate the new centroid for each cluster
  5. Repeat steps 3 and 4 until the centroids in clusters change.
  6. Terminate when they remain unchanged in successive cluster reassignments.

Consider the following example in the below figure where we want to cluster five students
into two groups based on their history and geography scores.

Image title

As this is a two-dimensional data set, we plot the history and geography scores for
each student in the x and y-axis, respectively. We can visualize two prominent clusters
from the graph. Let’s run this data through the k-means algorithm and
find out the cluster results.

Image title

STEP 1: Specify the number of clusters and metrics to calculate the distance.
As we intend to find two clusters in our sample data set, we set the number of clusters (k) = 2. Assuming the marks are in the same unit and scale, we choose Euclidean distance as the metric for calculating distance. Euclidean distance between a point (x1,y1) and the centroid (cx1,cy2) is given by the following:

dist((x1,y1),(cx2,cy2)) = sqrt((x1-cx2)2+(y1-cy2)2

STEP 2: Algorithm finds the initial centroid.

As we are doing the k-means calculation manually, we randomly choose the first two
records to be the initial centroids

Image title

STEP 3: Calculate the distance between the centroid and each data point.

Using the Euclidean formula given in the equation, we calculate the distance of each
data point from each cluster. The closest cluster centroid was assigned a cluster for the
respective data point.

Image title

STEP 4: The new centroid/mean for each cluster is recalculated.

Image title

STEP 5: The distance between data points and cluster centroids are again calculated. As the cluster assignment doesn’t change, we get our two final clusters. If we compare the clusters in the plot and in the below figure, we find the k-means algorithm has discovered the same clusters.

Image title

The above blog, explaining the concepts of K-Means clustering, is an excerpt from my book Data Science Using Oracle Data Miner and Oracle R Enterprise.

Categories: Machine Learning
Prev Post
Next Post

Add your Comment

Recent Posts

  • Insights on Data Science Automation for Big Data and IoT Environments
  • The Changing Landscape: Data Science Trends
  • Streamline the Machine Learning Process Using Apache Spark ML Pipelines
  • Dive Deep Into Deep Learning
  • CEP Patterns for Stream Analytics

Recent Comments

    Archives

    • June 2020

    Post Categories

    • Business Analytics
    • Machine Learning
    • Popular Content

    Meta

    • Log in
    • Entries feed
    • Comments feed
    • WordPress.org

    About AnalyticKast

    Author

    Our goal is to provide easy access to people on data technologies related information to thrive in this digital economy.

    Latest Posts

    Insights on Data Science Automation for Big Data and IoT Environments

    Insights on Data Science Automation for Big Data and IoT Environments

    June 30, 2020

    The Changing Landscape: Data Science Trends

    June 30, 2020

    Streamline the Machine Learning Process Using Apache Spark ML Pipelines

    June 30, 2020

    About Analytickast

    One-stop knowledge services platform that supports individuals connect the dots between technologies and management to build data products. Our goal is to provide easy access to people on data technologies related information to thrive in this digital economy.

    Blogs

    • Business Analytics
    • Machine Learning
    • Popular Content

    Quick Links

    • Home
    • Blog
    • Coaching
    • Shop Now
    • Contact Us
    • My account
    • Log In

    Our Videos

    All Rights Reserved © 2020. - www.analytickast.com .

    • Privacy Policy
    • Legal Disclaimer
    • Terms of Use

    Insert/edit link

    Enter the destination URL

    Or link to existing content

      No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.