Comparisons of statistical methods for determining gene expression signatures to predict binary cancer response
Citation & Export
Hide
Simple citation
Dong, Qian.
Comparisons of statistical methods for determining gene expression signatures to predict binary cancer response. Retrieved from
https://doi.org/doi:10.7282/T3V69M6G
Export
Description
TitleComparisons of statistical methods for determining gene expression signatures to predict binary cancer response
Date Created2014
Other Date2014-10 (degree)
Extent1 online resource (xxvi, 234 p. : ill.)
DescriptionCancer is a major public health problem with high mortality and mobility. In the past few decades, developments and progress of high-throughput molecular technologies have been used in diagnosing and managing treatments for cancers. Cancer classification using gene expression data poses many challenges to classical supervised learning methods. The main objective of this dissertation is to evaluate and compare the performances of six selected different classification methods, denoted as Logit (logistic regression), Lasso (least absolute shrinkage and selection operator), CART (classification and regression tree), RF (random forest), GBM (gradient boosted models), and SVM (support vector machine), for predicting binary cancer outcomes using gene expression data. We compare the performance using both real life datasets (prostate cancer data and breast cancer data) and extensive simulation experiments. Consistent with findings from previous comparisons of classifiers, the best classifier for predicting binary outcome varies with the dataset and the evaluation measures. No universally best performed classifier is identified which can work for all empirical datasets and under all simulation scenarios. When we compare different methods for classifications, especially classifiers for predicting cancer outcomes, accuracy should not be only thing we consider; other factors, such as simplicity to implement, ease of interpretation for clinicians or biologists, the biological insights that can be gained from the analysis results of a classifier, should also be taken into account. In addition, we have provided clear and easy-to-follow procedures of predictive model building and performance assessment for clinical researchers when there is a need to compare classification results from different classifier. We have addressed the binary classification problem in our thesis, but this approach should be easily applied to multi-category classification problems or to survival analysis problems. Based on results from real life datasets and extensive simulation experiments, we have found that when working with classification problem using high dimensional data, simple but widely used classification method, such as logistic regression has its limitation, and may not achieve the desirable performance. Classifiers designed to handle large numbers of predictors, such as Lasso, GBM, SVM and RF, are better choice in such situations.
NoteDr.P.H.
NoteIncludes bibliographical references
NoteQian Dong
Genretheses, ETD doctoral
Languageeng
CollectionSchool of Public Health ETD Collection
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.