TY - JOUR TI - Boundless data analytics through progressive mining DO - https://doi.org/doi:10.7282/t3-jse5-ng63 PY - 2018 AB - Multidimensional distributions in data mining are often represented as plots: scatter plots between two numerical variables; heat maps, bar graphs, histograms, box plots - they either relate two variables together or show frequency distributions of one variable. What makes one distribution more interesting than the other? What if we could generate all possible relationships and rank the most interesting ones at the top - do it all automatically, thus saving days of repetitive human work? We define an attribute-value pair from a dimension as a descriptor, and a conjunction of k descriptors is used to slice a dataset. The problem of generating all possible large data slices is formalized as the frequent itemset mining problem. Because the number of dimensions may also include derived dimensions so we do not know ahead of time how long the process will take, may even take an unbounded amount of time. We explore solutions which can answer the following questions: 1) Can we provide some progress indicator during this process? 2) Is the best-so-far partial solution available at any time? To this end, we investigate the anytime algorithms and propose a dynamic approach called ALPINE that allows us to achieve flexible trade-offs between efficiency and completeness. ALPINE is to our knowledge the first algorithm to progressively mine frequent itemsets and closed itemsets support-wise. It guarantees that all itemsets with support exceeding the current checkpoint’s support have been found before it proceeds further. ALPINE runs literally forever without a priori decided minimum support value. The ALPINE approach is also generalized to multiple tables based on the Entity-Relationship Modeling without joining the tables to form a single big table. Finally, we build a boundless analytics system, which can slice a given dataset in all possible ways and generate very large (unbounded) number of plots. The generated plot objects are organized and indexed in a plot base to support the user queries. A search interface with user-friendly search query language is designed to explore all the plots and the query response are sorted nicely based on some interestingness measure. The system is used to analyze the extensive historical NBA Players stats data with promising results. KW - Computer Science KW - Data mining KW - ALPINE KW - Entity-relationship modeling KW - Computer algorithms LA - eng ER -