New algorithmic frameworks for geometric optimization: Clustering, matching, and fitting
Geometric optimization problems ubiquitously exist in many real-world applications. Driven by the fast growing data size and other realistic constraints, geometric optimization problems are now far more challenging and complicated than their basic forms. This makes existing techniques no longer applicable. In this thesis, we study three fundamental geometric optimization problems, clustering , matching , and fitting . Despite extensive previous studies on these problems, significant gaps still exist between the current state-of-the-arts techniques and their desired solutions. This issue could be further worsened by a number of new requirements arising in various applications, such as robustness, uncertainty, privacy preserving, etc. To overcome these new challenges, we present several algorithmic frameworks to solve many variants of the three geometric optimization problems. Clustering We consider a class of constrained clustering problems of points in R d space, where d could be rather high. A common feature of these problems is that their optimal clusterings no long have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present a unified framework, called Peeling-and-Enclosing , to iteratively solve two variants of the constrained clustering problems, constrained k-means clustering ( k- CMeans) and constrained k-median clustering ( k- CMedian). Our framework is based on two standalone geometric techniques, called Simplex Lemma and Weaker Simplex Lemma , for k- CMeans and k- CMedian, respectively. Simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. With these techniques, our framework generates, in nearly linear time ( i.e., O ( n (log n ) k +1 d )), O ((log n ) k ) k- tuple candidates for the k mean or median points, and one of them induces a 1 + ε)-approximation for k- CMeans or k- CMedian, where n is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best k- tuple candidate), we obtain a (1 + ε)-approximation for each of the constrained clustering problems. Our framework improves considerably the best known results for these problems. We expect that our technique will be applicable to other constrained clustering problems without locality. Matching: We consider the problem (denoted as EMDRT) of minimizing the earth mover's distance between two sets of weighted points A and B in a fixed dimensional R d space under rigid transformation. EMDRT is an important problem in both theory and applications and has received considerable attentions in recent years. Previous research on this problem has resulted in only constant approximations and it has been an open problem for a long time to achieve PTAS solution. In this thesis, we present the first FPTAS algorithm for EMDRT. Our algorithm runs roughly in O (( nm ) d +2 (log nm ) 2 d ) time (which is close to a lower bound on any PTAS for this problem), where n and m are the sizes of A and B , respectively. Our result is based on several new general techniques, such as the Sequential Orthogonal Decomposition ( SOD ) and Optimum Guided Base ( OGB ), and can be extended to several related problems, such as the problem of earth mover's distance under similarity transformation and the alignment problem, to achieve FPTAS for each of them. Fitting: Least Trimmed Squares ( LTS ) estimator is a statistical tool for estimating how well a set of points fits a hyperplane. As a robust alternative to the classical least squares estimator, LTS takes as input a set P of n points in R d and a fitting parameter m ≤ n , and computes a non-vertical hyperplane H so that the sum of the m smallest squared vertical distances from P to H is minimized. Previous research has indicated that although solving LTS (exactly or approximately) could be quite costly ( i.e. , it may take Ω( n d -1 ) time to even approximate it when m/n is a positive constant c < 1), a hybrid version of approximation, which is a bi-criteria on residual approximation and quantile approximation , can be obtained in linear time in any fixed dimensional space. In this thesis, we further show that an (ε r , ε q )-hybrid approximation of LTS can be computed in sub-linear time, where r > 0 is the residual approximation ratio and 0 < q < 1 is the quantile approximation ratio. The running time is independent of the input size n , when m = Θ( n ). Comparing to existing result, our approach has quite a few advantages, e.g. , is much simpler, has better robustness, takes only constant additional space, and can deal with big data ( e.g. , streaming data). Our result is based on new insights to the problem and several novel techniques, such as recursive slab partition , sequential orthogonal rotation , and symmetric sampling . Our technique can also be extended to achieve sub-linear time hybrid approximations for several related problems, such as data-oblivious computation for LTS in Secure Multi-party Computation (SMC) protocol, LTS on uncertain and range data, and the Orthogonal Least Trimmed Squares ( OLTS ) problem. It is likely that our technique will be applicable to other shape fitting problems.