Gibert Renart, Eduard. Programming and managing data-driven applications between the edge and the cloud. Retrieved from https://doi.org/doi:10.7282/t3-g4vd-km15
DescriptionDue to the proliferation of the Internet of Things (IoT), the number of devices connected to the Internet is growing. These devices are generating large volumes of data at the edge of the infrastructure. According to International Data Corporation (IDC) predictions by 2025 the worldwide data will reach 180 zettabytes (ZB), and more than half of that data will come from IoT sensors. Although the generated data provides great potential for science and society, identifying and processing relevant data points hidden in streams of unimportant data, and doing this in near real-time, remains a significant challenge. The prevalent model of moving data from the edge to the cloud of the network is becoming unsustainable, resulting in an impact on latency, network congestion, storage cost and privacy.
These observations can be leveraged to design hybrid architectures that can leverage both the edge and the cloud resources to process the data in a timely manner. Although the cloud is better suited to perform heavier (resource intensive) analysis, such as processing historical events and very large datasets, edge devices can support real-time analytics that consider the temporal and spatial characteristics of IoT data. While edge processing can benefit IoT applications, edge resources are typically constrained in their capabilities. In addition integrating edge computing can also add complexity to applications, especially when they need to include policies that govern what kind of data is processed and analyzed at the edge and what is sent to cloud.
To address these challenges, this dissertation presents an IoT Edge Framework, called R-Pulsar, that extends cloud capabilities to local devices and provides a programming model for deciding what, when, where and how data get collected and processed. This thesis makes the following contributions: (1) A content- and location-based programming abstraction for specifying what data gets collected and where the data gets analyzed. (2) A rule-based programming abstraction for specifying when to trigger data-processing tasks based on data observations. (3) A programming abstraction for specifying how to split a given dataflow and place operators across edge and cloud resources. (4) An operator placement strategy that aims to minimize an aggregate cost which covers the end-to-end latency (time for an event to traverse the entire dataflow), the data transfer rate (amount of data transferred between the edge and the cloud) and the messaging cost (number of messages transferred between edge and the cloud). (5) Performance optimizations on the data-processing pipeline in order to achieve real-time performance on constrained devices. The applicability of this work to real-world IoT applications is validated through a series of experiments in which shows that R-Pulsar can reduce the bandwidth consumption
between the edge and the cloud by up to 82% and obtain results 40% faster than the traditional approach of moving all the data to the cloud.