For those interested in analytics, data clustering is an important concept that will almost certainly play a significant role in a potential career path.
Clustering in data mining involves the segregation of subsets of data into clusters because of similarities in characteristics. This helps users better understand the structure of a data set as similar data points are put together in different groupings.
Data clustering is considered one of the key strategies in data mining. For example, in marketing, researchers can cluster a company’s client base into different subgroups based on similarities such as age, location, and frequency of purchases. This allows for more focused targeting of marketing messages.
Types of Clustering
There are a variety of approaches to clustering in data mining. Typically, they fall into one of these major categories.
K-Means Clustering- This is a popular method because it can be learned quickly and works well with large datasets. It involves creating random cluster centers (centroids) within large data sets and repeating these clusters until the variation in the centroids is minimal. The drawbacks for this method include having to know in advance how many clusters there are in the data. Also, results can vary depending on where the initial centroids are placed.
Mean Shift Clustering- This method determines the number of clusters and can handle clusters of different shapes, unlike K-Means. However, it is a far slower method.
Expectation-Maximization- Like K-Means, you must set the clusters beforehand. Unlike K-Means, this method can handle Gaussian Clusters, which can use hard clustering (assigning data points to one cluster) or soft clustering (allowing data points to be assigned to more than one cluster). .
Agglomerative Hierarchical Clustering- This is a “bottom-up” method that gradually puts together data points until they can be moved into clusters. Eventually, all data points reside in a cluster. The drawback is that this method is slow and cannot be used on large datasets.
Why Are Clusters Important to Healthcare?
Data clusters are important as they can uncover hidden trends or patterns within large data sets. However, it is an approach that is “relatively underutilized” at this point in healthcare, according to an editorial from the Journal of Mental Health.
The editorial argues that in clinical populations, clustering can help uncover the heterogeneity that exists in patient characteristics, illness severity and treatment responses. Understanding these differences with patients can lead to efficient, effective healthcare that personalizes treatment to match a patient’s profile.
Others have looked at ways to use clustering in healthcare data mining. One study, written by researchers with Novartis, focused on healthcare claims, an area where clustering in data mining has not been widely used because the “distribution of expenditure data is commonly severely skewed,” according to the report.
Researchers focused specifically on cost change patterns for patients with end-stage renal disease who initiated hemodialysis. They were able to cluster and identify cost patterns among similar patients, such as those with increasing comorbidity scores (those patients with two or more chronic conditions simultaneously).
How Can Clustering Improve Treatment?
As the Journal of Mental Health editorial argued, clustering can identify characteristics that allow for researchers to group patients with similar conditions, diseases, or patient profiles.
They used depression as an example. Mental health professionals already know that there is heterogeneity among those with depression based on age at the onset of depression, exposure to stress, and the severity of the depression (including mild, moderate, and severe).
By identifying subgroups within the patient population, there could be benefits that include the development of diagnostic criteria, explanations of heterogeneous outcomes and better tailoring of treatment for patients within the various subgroups.
Researchers from the Bangladesh University of Engineering and Technology also wrote that clustering could help identify the likelihood of diseases among certain patient populations. By using K-Means clustering and relevant medical background information, they argue it’s possible to anticipate the development of disease or medical conditions in certain patient subgroups.
Clustering in data mining, if used properly, may provide those working in healthcare analytics with another method for personalizing treatment and possibly anticipating medical problems in specific patient populations.