Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • Communities & Collections
  • Research Outputs
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Engineering
  3. Computing
  4. Computing PhD theses
  5. Data centre resource scheduling for dataflow applications
 
  • Details
Data centre resource scheduling for dataflow applications
File(s)
Fertakis-K-2024-PhD-Thesis.pdf (6.93 MB)
Thesis
Author(s)
Fertakis, Konstantinos
Type
Thesis or dissertation
Abstract
Modern data centres host distributed, data-intensive applications such as machine learning and data analytics. Recent hardware advancements and data scale are making the network increasingly important in determining application performance. Consequently, the synergy in compute and network resource allocation becomes crucial. Yet, existing resource and workload scheduling practices remain siloed, leaving unfulfilled optimisation opportunities. At the same time, new application domains introduce different degrees of freedom in how their dataflows are mapped onto cluster resources.

In this dissertation, we argue for holistic compute and network resource scheduling with dataflow planning for distributed data-intensive workloads in modern data centres. We make the following contributions in mapping dataflow operations onto cluster resources by exploiting the newfound flexibilities.

First, the advent of programmable fabric devices enables applications to accelerate their performance by offloading computations inside the fabric. These deployments, however, introduce new challenges as existing independent resource scheduling practices are ill-equipped to handle them. We design a new cluster scheduler that jointly optimises compute and fabric resource allocations in modern data centres. We introduce a novel task placement algorithm that optimises application and fabric bandwidth distribution.

Second, distributed ML training deployments can use different topologies to fulfil their communication requirements. However, current training systems employ static communication strategies, which hurt performance under varying network conditions. We develop a distributed ML training library that enables the dynamic adaptation of communication topologies. Our work employs an efficient performance monitoring and deployment reconfiguration mechanism to enable runtime adaptations.

Third, emerging multimodel ML training deployments scale by distributing computation across devices. However, existing distribution strategies statically allocate resources to models, leading to poor hardware utilisation. We propose a novel training scheduling approach that dynamically allocates resources to models and adapts the deployment resource requirements. Our approach enables independent model scheduling and dynamic offloading between hosts and devices.
Version
Open Access
Date Issued
2024-12-20
Date Awarded
2025-10-01
URI
https://hdl.handle.net/10044/1/123897
DOI
https://doi.org/10.25560/123897
Copyright Statement
Attribution-NonCommercial 4.0 International Licence (CC BY-NC)
License URL
ttps://creativecommons.org/licenses/by-nc/4.0/
Advisor
Pietzuch, Peter
Publisher Department
Department of Computing
Publisher Institution
Imperial College London
Qualification Level
Doctoral
Qualification Name
Doctor of Philosophy (PhD)
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback