CORS / Optimization Days
HEC Montréal, May 29-31, 2023
CORS-JOPT2023
HEC Montreal, 29 — 31 May 2023
CLG Clustering
May 31, 2023 01:30 PM – 03:10 PM
Location: Xerox Canada (yellow)
Chaired by John Chinneck
4 Presentations
-
01:30 PM - 01:55 PM
A Branch-and-Cut Method for Accurate Clustering of Networked Data Through Maximizing Modularity
Community detection, or clustering networked data, is a classic problem in network science with extensive applications in various fields. Common methods are heuristics designed to maximize modularity across different ways that the nodes of a network can be partitioned into communities. Most heuristics return partitions with no guarantee of proximity to an optimal solution. Moreover, their sub-optimal partitions are disproportionally dissimilar to any optimal partition.
Given these methodological issues, we propose the Bayan algorithm: a community detection method which, unlike the existing methods, returns network partitions with a guarantee of either optimality or proximity to an optimal solution. At the core of this algorithm is a branch-and-cut scheme that solves an Integer Programming (IP) formulation of the modularity maximization problem faster than commercial IP solvers.
We assess Bayan and other algorithms and measure their accuracy in returning planted ground-truth partitions on synthetic data. Our results on accuracy ranks of 23 algorithms on 500 benchmark graphs show Bayan's unique capabilities not only in maximizing modularity, but more importantly in accurate retrieval of ground-truth communities. This points to Bayan as a suitable choice for a methodologically sound usage of modularity for discovering communities in networks with up to 3000 edges.
-
01:55 PM - 02:20 PM
Biogeography-Based Optimization (BBO) for Dimensionality Reduction
K-means is the most popular data clustering algorithm due to its efficiency and simplicity of implementation. K-means has succeeded in many applications, including bioinformatics, pattern recognition, image processing, document clustering, and recommendation systems. However, K-means has some limitations, which may affect its effectiveness, such as all the features having the same degree of importance. To address these limitations and improve K-means accuracy, we adopt a methodology that relies on the Biogeography-Based Optimization (BBO) algorithm to select the most relevant features of datasets. Our primary idea is to reduce the intra-cluster distance while increasing the distance between clusters. We adopt the Davies-Bouldin index using several datasets to compare the proposed evolutionary-based clustering algorithm to the baseline K-mean. Moreover, we conducted additional experiments comparing our proposed approach to other methods based on PCA and nature-inspired techniques. We use elbow and silhouette metrics in addition to Davies-Bouldin. The results demonstrate the effectiveness of our proposed algorithm in selecting the most relevant features..
Keywords: Feature Selection, Data Clustering, K-means, Evolutionary Algorithms
-
02:20 PM - 02:45 PM
Heuristic Modularity Maximization Algorithms for Community Detection Rarely Return an Optimal Partition or Anything Similar
Community detection is a classic graph problem with many applications. It is usually tackled using heuristics which are designed to maximize an objective function, modularity, over different partitions of the graph nodes into communities. Using 80 real and random networks from a wide range of contexts, we investigate the extent to which current heuristic modularity maximization algorithms succeed in returning modularity-maximum (optimal) partitions. We evaluate (1) the ratio of their output modularity to the maximum modularity for each input graph and (2) the maximum similarity between their output partition and any optimal partition of that graph. Our computational experiments involve eight existing heuristic algorithms which we compare against an exact integer programming method that globally maximizes modularity. The average modularity-based heuristic algorithm returns optimal partitions for only 16.9% of the 80 graphs considered. Results on adjusted mutual information show considerable dissimilarity between the sub-optimal partitions and any optimal partitions of the graphs in our experiments. More importantly, our results show that near-optimal partitions tend to be disproportionally dissimilar to any optimal partition. Taken together, our analysis points to a crucial limitation of commonly used modularity-based algorithms for discovering communities: they rarely return an optimal partition or a partition resembling an optimal partition. Given this finding, developing an exact or approximate algorithm for modularity maximization is recommendable for a more methodologically sound usage of modularity in community detection.
-
02:45 PM - 03:10 PM
Fitting Hyperplanes to Minimize the Quantile Error in Noisy Data Sets
Given a noisy set of data points (general fit) or a set of data points and an outcome variable (regression fit), contaminated with outliers, one can try to fit a hyperplane that minimizes the maximum error for some quantile of the data points, often 50% of the points. This is a difficult problem in statistics. Mixed-Integer Optimization (MIO) algorithms have been developed in recent years. These do not scale well but can provide the optimum hyperplane for small data sets. We develop (i) heuristic methods that are fast, and especially effective for general fits, and (ii) an improved MIO formulation that is more scalable than the existing MIO formulation, and more effective, especially in combination with initialization heuristics. Since MIO solvers often reach the optimum solution early and spend a great deal of additional time proving that the optimum has been reached, we examine the effective of halting early. And since it is sometimes assumed that a quantile-error minimizing fit is a good fit to the complete set of inliers in a data set, we examine whether this is actually true.