Deciphering Clusters With a Deterministic Measure of Clustering Tendency

Alec F. Diallo, Paul Patras

Research output: Contribution to journalArticlepeer-review

Abstract

Clustering, a key aspect of exploratory data analysis, plays a crucial role in various fields such as information retrieval. Yet, the sheer volume and variety of available clustering algorithms hinder their application to specific tasks, especially given their propensity to enforce partitions, even when no clear clusters exist, often leading to fruitless efforts and erroneous conclusions. This issue highlights the importance of accurately assessing clustering tendencies prior to clustering. However, existing methods either rely on subjective visual assessment, which hinders automation of downstream tasks, or on correlations between subsets of target datasets and random distributions, limiting their practical use. Therefore, we introduce the Proximal Homogeneity Index (PHI) , a novel and deterministic statistic that reliably assesses the clustering tendencies of datasets by analyzing their internal structures via knowledge graphs. Leveraging PHI and the boundaries between clusters, we establish the Partitioning Sensitivity Index (PSI) , a new statistic designed for cluster quality assessment and optimal clustering identification. Comparative studies using twelve synthetic and real-world datasets demonstrate PHI and PSI's superiority over existing metrics for clustering tendency assessment and cluster validation. Furthermore, we demonstrate the scalability of PHI to large and high-dimensional datasets, and PSI's broad effectiveness across diverse cluster analysis tasks.
Original languageEnglish
Pages (from-to)1489-1501
Number of pages13
JournalIEEE Transactions on Knowledge and Data Engineering
Volume36
Issue number4
Early online date23 Aug 2023
DOIs
Publication statusPublished - 8 Mar 2024

Keywords / Materials (for Non-textual outputs)

  • data homogeniety
  • clustering tendency assessment
  • cluster analysis
  • knowledge graphs
  • knowledge representation
  • dimensionality reduction
  • exploratory data analysis

Fingerprint

Dive into the research topics of 'Deciphering Clusters With a Deterministic Measure of Clustering Tendency'. Together they form a unique fingerprint.

Cite this