DescriptionThis dissertation consists of three chapters. It develops new methodologies to address two specific problems of recent statistical research: • How to incorporate hierarchical structure in high dimensional regression model selection. • How to achieve semi-parametric efficiency in the presence of missing data. For the first problem, we provide a new approach to explicitly incorporate a given hierarchical structure among the predictors into high dimensional regression model selection. The proposed estimation approach has a hierarchical grouping property so that a pair of variables that are “close” in the hierarchy will be more likely grouped in the estimated model than those that are “far away”. We also prove that the proposed method can consistently select the true model. These properties are demonstrated numerically in simulation and a real data analysis on peripheral-blood mononuclear cell (PBMC) study. For the second problem, two frameworks are considered: generalized partially linear model (GPLM) and causal inference of observational study. Specifically, under the GPLM framework, we consider a broad range of missing patterns which subsume most publications on the same topic. We use the concept of least favorable curve and extend the generalized profile likelihood approach [Severini and Wong (1992)] to estimate the parametric component of the model, and prove that the proposed estimator is consistent and semi-parametrically efficient. Also, under the causal inference framework, we propose to estimate the mean treatment effect with non-randomized treatment exposures in the presence of missing data. An appealing aspect of this development is that we incorporate the post-baseline covariates which are often excluded from causal effect inference due to their inherent confounding effect with treatment. We derive the semiparametric efficiency bound for regular asymptotically linear (RAL) estimators and propose an estimator which achieves this bound. Moreover, we prove that the proposed estimator is robust against four types of model mis-specifications. The performance of the proposed approaches are illustrated numerically through simulations and real data analysis on group testing dataset from Nebraska Infertility Prevention Project and burden of illness dataset from Duke University Medical Center.