TY - JOUR TI - Data mining methodologies with uncertain data DO - https://doi.org/doi:10.7282/T3BZ69G4 PY - 2018 AB - Uncertain objects arise in many applications such as sensor networks, moving object databases and medical and biological databases where each feature is represented by multiple observations or a given or fitted probability density function (PDF). In this dissertation we present a methodology to classify uncertain objects based on a probabilistic distance measure between an uncertain object and a group of uncertain objects. We call this newly proposed measure object-to-group probabilistic distance measure, OGPDM, noting that dozens of probabilistic distance measures (PDM) for the distance between two pdfs exist in the literature. To assess the accuracy of the OGPDM, we compare it to some existing classifiers, i.e., K-Nearest Neighbor (KNN) classifier on object means (certain KNN) and uncertain naïve Bayesian classifier. In addition we compare OGPDM to an uncertain K-Nearest Neighbor (KNN) classifier, which we propose here, that uses existing PDMs to measure object-to-object distances and then classifies using KNN. We illustrate the advantages of the proposed OGPDM classifier with both simulated and real data. OGPDM captures the correlation among features within a class. Also, it takes into account the correlation among features within objects which is not taken into account in most of other uncertain data classification approaches. Because of existing levels of uncertainty for uncertain data objects, the scatter of this type of objects might be very different than the scatter of certain data objects. Measures of scatter for uncertain objects have not been defined before. Here in this dissertation, we define measures of scatter such as covariance matrix, within scatter matrix, and between scatter matrix, for uncertain data objects. Also, we extend the idea of Fisher Linear Discriminant Analysis (LDA) for uncertain objects. We also develop Kernel Fisher Discriminant for uncertain objects. The developed Uncertain Fisher LDA produces linear decision boundaries for separating classes of uncertain data objects while the developed Uncertain Kernel Fisher Discriminants produce nonlinear decision boundaries. The developed Uncertain Kernel Fisher Discriminants are for two cases: when the uncertain objects are given with PDF, and when the uncertain objects are given with multiple points. We show through examples that the obtained decision boundaries from our developed uncertain Fisher Discriminants seem very reasonable for separating classes of uncertain objects. Also, we compare the classification performance with many existing classifiers on simulated scenarios with uncertain objects modeled with skew-normal distribution and a real-world data set. To evaluate the quality of formed clusters and determine the correct number of clusters, clustering validity indices can be used. They can be applied on the results of clustering algorithms to validate the performance of those algorithms. In this dissertation, two clustering validity indices named uncertain Silhouette and Order Statistic, are developed for uncertain data. To the best of our knowledge, there is not any clustering validity index in the literature that is designed for uncertain objects and can be used for validating the performance of uncertain clustering algorithms. Our proposed validity indices use probabilistic distance measures to capture the distance between uncertain objects. They outperform existing validity indices for certain data in validating clusters of uncertain data objects and are robust to outliers. The Order Statistic index, in particular, a general form of uncertain Dunn validity index (also developed here), is well capable of handling instances where there is a single cluster that is relatively scattered (not compact) compared to other clusters, or there are two clusters that are close (not well-separated) compared to other clusters. The aforementioned instances can potentially result in the failure of existing clustering validity indices in detecting the correct number of clusters. KW - Industrial and Systems Engineering KW - Data mining LA - eng ER -