TY - JOUR TI - Robust models and evaluation for systems security research DO - https://doi.org/doi:10.7282/t3-qeqq-wf05 PY - 2020 AB - Machine learning in modern systems security research is common. Researchers regularly use machine learning based models for tasks such as authentication and user identification. Often the practices followed for developing and evaluating a machine learning model that forms the decision logic of these systems are misleading. For example, the maximum accuracy (ACC) can be inflated when the data used to train a model is class skewed. Additionally, models built on data from small user groups may achieve high performance values but fail to generalize in a larger population. These inflated performance values will lead to unexpected system-level failures. There are several metrics that are used to evaluate how well a system performs at the task of distinguishing users. Existing metrics are often inadequate because they fail to capture the range of possible contingencies that arise when the measurements that decisions are based on have inherent ambiguities. These ambiguities can result in mistaking one user for another. For authentication or user identification, the consequences for such mistakes is dictated by the target application. Mistakenly granting access to a bank account has significantly different consequences than loading the wrong set of user preferences. Many of the common metrics hide underlying problems within the machine learning models. Models that are not tested with an adequate number of users can fail in surprising ways. In this PhD thesis, we explore the underlying reasons why the metrics are misleading, and models fail to generalize. We identify the flaws with the metrics and show how some metrics can degrade in performance when assumptions about the number of users are violated. We present surveys of proposals for new authentication or user identification systems from top-tier publication venues. We found that 94% (33/35) the authentication systems surveyed had reporting flaws and 77% of user identification systems used less than 20 participants to validate their system. Finally, we present solutions to these issues in the form of metrics that can be visually checked for flaws and testing methods that can be used to determine when assumptions about population size break down. KW - Machine learning KW - Electrical and Computer Engineering LA - English ER -