SCRO / Journées de l'optimisation
HEC Montréal, 2931 mai 2023
CORSJOPT2023
HEC Montréal, 29 — 31 mai 2023
CLG Clustering
31 mai 2023 13h30 – 15h10
Salle: Xerox Canada (jaune)
Présidée par John Chinneck
4 présentations

13h30  13h55
A BranchandCut Method for Accurate Clustering of Networked Data Through Maximizing Modularity
Community detection, or clustering networked data, is a classic problem in network science with extensive applications in various fields. Common methods are heuristics designed to maximize modularity across different ways that the nodes of a network can be partitioned into communities. Most heuristics return partitions with no guarantee of proximity to an optimal solution. Moreover, their suboptimal partitions are disproportionally dissimilar to any optimal partition.
Given these methodological issues, we propose the Bayan algorithm: a community detection method which, unlike the existing methods, returns network partitions with a guarantee of either optimality or proximity to an optimal solution. At the core of this algorithm is a branchandcut scheme that solves an Integer Programming (IP) formulation of the modularity maximization problem faster than commercial IP solvers.
We assess Bayan and other algorithms and measure their accuracy in returning planted groundtruth partitions on synthetic data. Our results on accuracy ranks of 23 algorithms on 500 benchmark graphs show Bayan's unique capabilities not only in maximizing modularity, but more importantly in accurate retrieval of groundtruth communities. This points to Bayan as a suitable choice for a methodologically sound usage of modularity for discovering communities in networks with up to 3000 edges.

13h55  14h20
BiogeographyBased Optimization (BBO) for Dimensionality Reduction
Kmeans is the most popular data clustering algorithm due to its efficiency and simplicity of implementation. Kmeans has succeeded in many applications, including bioinformatics, pattern recognition, image processing, document clustering, and recommendation systems. However, Kmeans has some limitations, which may affect its effectiveness, such as all the features having the same degree of importance. To address these limitations and improve Kmeans accuracy, we adopt a methodology that relies on the BiogeographyBased Optimization (BBO) algorithm to select the most relevant features of datasets. Our primary idea is to reduce the intracluster distance while increasing the distance between clusters. We adopt the DaviesBouldin index using several datasets to compare the proposed evolutionarybased clustering algorithm to the baseline Kmean. Moreover, we conducted additional experiments comparing our proposed approach to other methods based on PCA and natureinspired techniques. We use elbow and silhouette metrics in addition to DaviesBouldin. The results demonstrate the effectiveness of our proposed algorithm in selecting the most relevant features..
Keywords: Feature Selection, Data Clustering, Kmeans, Evolutionary Algorithms

14h20  14h45
Heuristic Modularity Maximization Algorithms for Community Detection Rarely Return an Optimal Partition or Anything Similar
Community detection is a classic graph problem with many applications. It is usually tackled using heuristics which are designed to maximize an objective function, modularity, over different partitions of the graph nodes into communities. Using 80 real and random networks from a wide range of contexts, we investigate the extent to which current heuristic modularity maximization algorithms succeed in returning modularitymaximum (optimal) partitions. We evaluate (1) the ratio of their output modularity to the maximum modularity for each input graph and (2) the maximum similarity between their output partition and any optimal partition of that graph. Our computational experiments involve eight existing heuristic algorithms which we compare against an exact integer programming method that globally maximizes modularity. The average modularitybased heuristic algorithm returns optimal partitions for only 16.9% of the 80 graphs considered. Results on adjusted mutual information show considerable dissimilarity between the suboptimal partitions and any optimal partitions of the graphs in our experiments. More importantly, our results show that nearoptimal partitions tend to be disproportionally dissimilar to any optimal partition. Taken together, our analysis points to a crucial limitation of commonly used modularitybased algorithms for discovering communities: they rarely return an optimal partition or a partition resembling an optimal partition. Given this finding, developing an exact or approximate algorithm for modularity maximization is recommendable for a more methodologically sound usage of modularity in community detection.

14h45  15h10
Fitting Hyperplanes to Minimize the Quantile Error in Noisy Data Sets
Given a noisy set of data points (general fit) or a set of data points and an outcome variable (regression fit), contaminated with outliers, one can try to fit a hyperplane that minimizes the maximum error for some quantile of the data points, often 50% of the points. This is a difficult problem in statistics. MixedInteger Optimization (MIO) algorithms have been developed in recent years. These do not scale well but can provide the optimum hyperplane for small data sets. We develop (i) heuristic methods that are fast, and especially effective for general fits, and (ii) an improved MIO formulation that is more scalable than the existing MIO formulation, and more effective, especially in combination with initialization heuristics. Since MIO solvers often reach the optimum solution early and spend a great deal of additional time proving that the optimum has been reached, we examine the effective of halting early. And since it is sometimes assumed that a quantileerror minimizing fit is a good fit to the complete set of inliers in a data set, we examine whether this is actually true.