TY - JOUR TI - Advances in relationship clustering and outlier detection DO - https://doi.org/doi:10.7282/t3-fgbh-9h26 PY - 2021 AB - Generalized linear models (GLMs) are very popular to solve response modeling problems. But GLM users often encounter the problem of over-dispersion if there exists unobserved heterogeneity within the data. The first topic of my dissertation mainly addresses this problem by introducing a clustering method: HSA (Heterogenous Sample Auto-grouping) method, to reveal the hidden structure and account for the unobserved heterogeneity for GLMs. Furthermore, we developed a modeling framework of applying HSA to recover the decision boundary controlled by some structural variable in GLMs. My second dissertation topic is about deriving a directed neighborhood-based approach for local outlier detection. With the prevalence of local outlier detection techniques like local outlier factor (LOF), local outlier detection draws more and more attention. Many outlier detection methods based on this concept give us an outlying score representing how likely the corresponding data object to be an outlier. But the interpretation of the score is not consistent across different data sets. In order to resolve this problem, we propose a local outlier detection approach: LoCO (Local COnnectivity) method. It has stable performance in some challenging scenarios compared with existing local outlier detection techniques. An outline of the subsequent chapter content is given as follow: Chapter 2 introduces a novel clustering method: HSA method. We formulate the problem with a convex objective function. Since solving the optimization function is not trivial due to the nonlinear loss and many penalty terms, we introduce IOSA (Iterative Operator-splitting for Samples Algorithm) to solve the problem. The convergence of the algorithm is theoretically proved. As to the theoretical analysis, we analyzed the minimax lower bound and prediction upper bound of this type of problems. In the end, we also provide numerical examples to validate the model performance. We apply HSA method onto a tourism data and a bank marketing data as well. The resulting groups are reasonably justified. In Chapter 3, we introduce another application of HSA. HSA can be used to uncover the hidden structure within a data set. In many applications, the hidden structure of the data is actually determined by some structural variable which controls the general structure of the model instead of affecting the model as a standard covariate. We propose a three-stage modeling procedure: SD-HSA (Structural variable Driven-HSA) to solve such type of problems. At the first stage, we narrow down the structural variable candidates pool. Then we apply HSA incorporating structural variable’s information at the second stage. Finally, we select out the best model using model selection criteria like AIC or BIC. We also provide numerical and real data examples to explore the performance of the modeling framework. Chapter 4 introduces a local outlier detection method: LoCO method. It quantifies the degree of outlyingness of each data subject by constructing a local asymmetric network (LAN). LoCO score is easy to interpret, and more robust to density changes compared with current existing local outlier detection methods like local outlier factor (LOF). Furthermore, we calculate the "p-value" of each data based on LoCO scores using conformal prediction technique. We compare the performance of LoCO method and LOF through series of simulation examples. We also apply the new method in real data in the end. KW - Heterogeneity KW - Outliers (Statistics) KW - Cluster analysis KW - Statistics and Biostatistics LA - English ER -