TY - JOUR TI - Distributed frameworks for approximate data analytics DO - https://doi.org/doi:10.7282/t3-k81a-gp04 PY - 2020 AB - Data-driven discovery has become critical to the mission of many enterprises and scientific research. At the same time, the rate of data production and collection is outpacing technology scaling, suggesting that significant future investment, time, and energy will be needed for data processing. Straightforwardly increasing hardware resources can address the extra processing needs by either adding more CPU cores/memory (scale-up) or more worker nodes (scale-out). However, it will introduce higher computing cost that may not be feasible when budget is limited. One powerful tool to address the above challenge is approximate computing, which trades off computational time and resources with computational accuracy by reducing the amount of data needed to be processed. Fortunately, many data analytic applications such as data mining, log processing, video/image processing are amenable to approximation. In this thesis, we describe the design and implementation of approximation frameworks to accelerate distributed data analytics. We present the frameworks targeting a variety of tasks and datasets, including log aggregation, text analytics and video querying and aggregation: Our first work targets approximating aggregation jobs with error estimation. Aggregation is central to many decision support queries. Aggregation is also an important component in OLAP~(Online Analytical Processing) systems, and is frequently used for summarizing data patterns in business intelligence. Aggregation jobs often involve multiple transformation steps in a data processing pipeline. We design and implement a sampling-based approximation framework called ApproxSpark, that can rigorously derive estimators with error bounds for approximate aggregation. Our second work targets approximate text analytic tasks. We propose and evaluate a framework called EmApprox that uses sampling-based approximation to speed up the processing of a wide range of queries over large text datasets. EmApprox builds an index for a dataset by learning a natural language processing model, producing vectors representing words and subcollections of documents. Our approximation index can significantly improve approximate quality while processing a small amount of the data. It will apply to each sampling unit with a sampling rate proportional to its similarity to the query. We have implemented a prototype of EmApprox as a Python library, and used it to approximate aggregation, information retrieval, and recommendation tasks. Finally, we target approximate video analytics. Video data embed rich and high-quality information. Yet video analytics is particularly compute intensive as it often involves invoking a deep convolutional neural network~(CNN) for object detection. We design and implement a approximate video analytics framework called VidApprox for accelerating video queries that involve object detection. VidApprox first leverages cheap CNNs to learn vector representations of video segments, and further processes the vectors as a persistent index structure. At query processing time, the index lookup will serve as auxiliary information for only retrieving a subset of more similar video segments. It make downstream processing such as object detection or aggregation more efficient by only performing expensive operations such as CNN inference on the relevant video data. We show that approximation is a promising technique for reducing processing time for large datasets. However, approximation poses multifaceted challenges when applied to data processing tasks across different domains. In particular, approximation when applied can present a complicated trade-off space that involves processing time reduction, quality of computation results and preprocessing complexity. Our works not only demonstrates that it is possible to balance computational accuracy with processing time reduction, but also that a machine-learned compact representation of the data generated can function as index structure for improving approximation quality across different domains and data sets. KW - Data analytics KW - Computer Science LA - English ER -