Enhancing stream data processing: system optimizations and learned indexes
File(s)
Author(s)
Liang, Liang
Type
Thesis or dissertation
Abstract
This thesis aims to optimize stream data processing, which is important for real-time data analysis and decision-making. Stream data’s inherent properties, including unbounded size, high volume, and variable velocities, impose significant challenges on processing systems. These systems must continuously evolve to meet the requirements of modern stream data processing. This thesis presents optimizations to enhance the functionality, scalability, and performance of stream data processing at both the system and algorithmic levels.
At the system level, we focus on a stream system, dispel4py, designed for scientific workload computation. We enhance the scalability and state management of dispel4py by developing dynamic allocation, dynamic auto-scaling and hybrid optimizations. Specifically, dynamic allocation allows dispel4py to scale for each task depending on the workload demands, and dynamic auto-scaling enables the entire workload to scale with fewer or more resources to maintain the performance while achieving cost efficiency. Furthermore, hybrid enables dispel4py to support stateful tasks and scaling simultaneously. Comprehensive experiments validate the scalability, portability, and performance of these three optimizations.
At the algorithm level, our focus shifts to Index-Based Window Processing (IBWP). Recently, learned indexes integrating machine learning models to enhance query performance present a promising alternative to traditional index structures. Motivated by this trend, we explore how learned indexes can effectively support search while maintaining updates for high-velocity data streams. However, the challenge lies in the inherent limitations of current updatable learned indexes. These limitations are often inherited from their traditional tree-based structures, which are cumbersome and impede update performance. To overcome these limitations, we pioneered the use of innovative queue-style flat structures, which significantly enhance update efficiency and reduce the index footprint. Based on the flat structures, we propose FLIRT and SWIX, designed for sequential IBWP and generic IBWP, respectively. Our experiments demonstrate that they effectively manage their respective IBWPs, outperforming all baselines.
At the system level, we focus on a stream system, dispel4py, designed for scientific workload computation. We enhance the scalability and state management of dispel4py by developing dynamic allocation, dynamic auto-scaling and hybrid optimizations. Specifically, dynamic allocation allows dispel4py to scale for each task depending on the workload demands, and dynamic auto-scaling enables the entire workload to scale with fewer or more resources to maintain the performance while achieving cost efficiency. Furthermore, hybrid enables dispel4py to support stateful tasks and scaling simultaneously. Comprehensive experiments validate the scalability, portability, and performance of these three optimizations.
At the algorithm level, our focus shifts to Index-Based Window Processing (IBWP). Recently, learned indexes integrating machine learning models to enhance query performance present a promising alternative to traditional index structures. Motivated by this trend, we explore how learned indexes can effectively support search while maintaining updates for high-velocity data streams. However, the challenge lies in the inherent limitations of current updatable learned indexes. These limitations are often inherited from their traditional tree-based structures, which are cumbersome and impede update performance. To overcome these limitations, we pioneered the use of innovative queue-style flat structures, which significantly enhance update efficiency and reduce the index footprint. Based on the flat structures, we propose FLIRT and SWIX, designed for sequential IBWP and generic IBWP, respectively. Our experiments demonstrate that they effectively manage their respective IBWPs, outperforming all baselines.
Version
Open Access
Date Issued
2024-09-20
Date Awarded
01/02/2025
Advisor
Heinis, Thomas
Publisher Department
Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)