Data centre resource scheduling for dataflow applications
File(s)
Author(s)
Fertakis, Konstantinos
Type
Thesis or dissertation
Abstract
Modern data centres host distributed, data-intensive applications such as machine learning and data analytics. Recent hardware advancements and data scale are making the network increasingly important in determining application performance. Consequently, the synergy in compute and network resource allocation becomes crucial. Yet, existing resource and workload scheduling practices remain siloed, leaving unfulfilled optimisation opportunities. At the same time, new application domains introduce different degrees of freedom in how their dataflows are mapped onto cluster resources.
In this dissertation, we argue for holistic compute and network resource scheduling with dataflow planning for distributed data-intensive workloads in modern data centres. We make the following contributions in mapping dataflow operations onto cluster resources by exploiting the newfound flexibilities.
First, the advent of programmable fabric devices enables applications to accelerate their performance by offloading computations inside the fabric. These deployments, however, introduce new challenges as existing independent resource scheduling practices are ill-equipped to handle them. We design a new cluster scheduler that jointly optimises compute and fabric resource allocations in modern data centres. We introduce a novel task placement algorithm that optimises application and fabric bandwidth distribution.
Second, distributed ML training deployments can use different topologies to fulfil their communication requirements. However, current training systems employ static communication strategies, which hurt performance under varying network conditions. We develop a distributed ML training library that enables the dynamic adaptation of communication topologies. Our work employs an efficient performance monitoring and deployment reconfiguration mechanism to enable runtime adaptations.
Third, emerging multimodel ML training deployments scale by distributing computation across devices. However, existing distribution strategies statically allocate resources to models, leading to poor hardware utilisation. We propose a novel training scheduling approach that dynamically allocates resources to models and adapts the deployment resource requirements. Our approach enables independent model scheduling and dynamic offloading between hosts and devices.
In this dissertation, we argue for holistic compute and network resource scheduling with dataflow planning for distributed data-intensive workloads in modern data centres. We make the following contributions in mapping dataflow operations onto cluster resources by exploiting the newfound flexibilities.
First, the advent of programmable fabric devices enables applications to accelerate their performance by offloading computations inside the fabric. These deployments, however, introduce new challenges as existing independent resource scheduling practices are ill-equipped to handle them. We design a new cluster scheduler that jointly optimises compute and fabric resource allocations in modern data centres. We introduce a novel task placement algorithm that optimises application and fabric bandwidth distribution.
Second, distributed ML training deployments can use different topologies to fulfil their communication requirements. However, current training systems employ static communication strategies, which hurt performance under varying network conditions. We develop a distributed ML training library that enables the dynamic adaptation of communication topologies. Our work employs an efficient performance monitoring and deployment reconfiguration mechanism to enable runtime adaptations.
Third, emerging multimodel ML training deployments scale by distributing computation across devices. However, existing distribution strategies statically allocate resources to models, leading to poor hardware utilisation. We propose a novel training scheduling approach that dynamically allocates resources to models and adapts the deployment resource requirements. Our approach enables independent model scheduling and dynamic offloading between hosts and devices.
Version
Open Access
Date Issued
2024-12-20
Date Awarded
2025-10-01
Copyright Statement
Attribution-NonCommercial 4.0 International Licence (CC BY-NC)
Advisor
Pietzuch, Peter
Publisher Department
Department of Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)