f-CNN\textsuperscript{x}: A Toolflow for Mapping Multiple Convolutional Neural Networks on FPGAs

Stylianos I. Venieris
Department of Electrical and Electronic Engineering
Imperial College London
Email: stylianos.venieris10@imperial.ac.uk

Christos-Savvas Bouganis
Department of Electrical and Electronic Engineering
Imperial College London
Email: christos-savvas.bouganis@imperial.ac.uk

Abstract—The predictive power of Convolutional Neural Networks (CNNs) has been an integral factor for emerging latency-sensitive applications, such as autonomous drones and vehicles. Such systems employ multiple CNNs, each one trained for a particular task. The efficient mapping of multiple CNNs on a single FPGA device is a challenging task as the allocation of compute resources and external memory bandwidth needs to be optimised at design time. This paper proposes f-CNN\textsuperscript{x}, an automated toolflow for the optimised mapping of multiple CNNs on FPGAs, comprising a novel multi-CNN hardware architecture together with an automated design space exploration method that considers the user-specified performance requirements for each model to allocate compute resources and generate a synthesisable accelerator. Moreover, f-CNN\textsuperscript{x} employs a novel scheduling algorithm that alleviates the limitations of the memory bandwidth contention between CNNs and sustains the high utilisation of the architecture. Experimental evaluation shows that f-CNN\textsuperscript{x}’s designs outperform contention-unaware FPGA mappings by up to 50\% and deliver up to 6.8x higher performance-per-Watt over highly optimised GPU designs for multi-CNN systems.

I. INTRODUCTION

Over the last decade, Convolutional Neural Network (CNN) models have substantially improved the state-of-the-art performance in several Artificial Intelligence (AI) tasks. This property has made CNNs an enabling technology for novel systems in both embedded and cloud applications. On the one side of the spectrum, autonomous robots and vehicles is an emerging field that has gathered wide interest from both the academic and industrial communities due to its potential societal and economic effects. On the other end, data centre-based analytics that employ CNNs to serve a large pool of clients is becoming a widespread operational model.

Both embedded and data centre-based AI systems rely their operation on multiple CNNs. In latency-critical, vision-centric autonomous systems, perception is largely based on highly accurate and reliable computer vision tasks, such as object detection [3] and semantic segmentation [4]. Similarly, cloud-based systems have to cope with servicing a wide range of concurrent CNN-based applications, from bioinformatics to visual search [5], with stringent response-time demands. In such scenarios, a dedicated model is trained for each particular task, leading to the parallel execution of several CNNs on the same target platform. Moreover, the latency-sensitive nature of modern applications prohibits the use of batch processing. As a result, in both emerging embedded and cloud applications there is a requirement for the latency-driven mapping of multiple CNNs on the computing platform of the target system.

Currently, the conventional computing infrastructure of complex autonomous systems and data centres comprises CPUs and GPUs, which are able to provide high processing speed at the expense of high power consumption. A potential alternative platform that can offer both the flexibility and performance that is required by modern CNNs at a lower power envelop are the FPGAs. In the space of multi-CNN systems, FPGAs offer unique optimisation opportunities due to the possibility of fine-grained allocation of resources, which is not offered by other platforms. However, until now, CNN implementations, including FPGA-based accelerators [6]–[8], are typically designed and optimised for scenarios where a single model is running for an extensive period of time, while the multiple CNNs setting has remained unexplored.

In this paper, we propose f-CNN\textsuperscript{x}, an automated framework that maps multiple CNNs on a target FPGA, by taking into account the application-level required performance for each model and the available hardware resources, in order to generate an optimised multi-CNN architecture. The proposed framework exploits the structure of CNN workloads and the fine-grained control over resource allocation of FPGAs to yield latency-optimised designs that overcome the limitations of other parallel platforms targeting multiple CNNs. This paper makes the following key contributions:

- A novel architecture for the parallel execution of multiple CNNs on a single FPGA. The proposed architecture is parametrised to allow the fine-grained allocation of resources among CNNs and the deterministic scheduling of external memory transfers to minimise memory contention. This parametrisation enables us to explore the design space of a wide range of resource and bandwidth allocations.

- A novel design space exploration algorithm for efficiently traversing the large design space. The proposed algorithm co-optimises the mapping of multiple CNNs on the target FPGA and incorporates the application-level importance of each model by means of multiobjective cost functions in order to guide the design space exploration to the optimum design points. Moreover, a scheduling algorithm is proposed for the optimised sharing of the external memory bandwidth.

- The f-CNN\textsuperscript{x} automated toolflow for mapping multiple CNNs on a particular FPGA-based platform, taking as input a target set of CNNs in a high-level description, performing fast design space exploration and generating a synthesisable Vivado HLS hardware design.

To the best of our knowledge, this work addresses for the first time in the literature the mapping of multiple CNNs.

II. MULTIPLE CNNs ON RECONFIGURABLE LOGIC

A. Background on Multi-CNN Systems

Multi-CNN systems employ a number of models, with each one trained for a different task. In the embedded space, drones and self-driving cars run a variety of concurrent tasks, such as navigation and obstacle avoidance [9]. In the cloud, services are increasingly heterogeneous, with diverse workloads executed concurrently for a large number of users [5]. Nevertheless, mapping multiple CNNs on a computing platform poses a challenge. With each model targeting a different task, the performance constraints, such as minimum
throughput and maximum latency, vary accordingly. Moreover, in resource-constrained setups, multiple CNNs compete for the same pool of computational and memory resources. As a result, the mapping of multiple CNNs is a high-dimensional design problem that encompasses both the performance needs of each model and the resource constraints of the target platform.

B. Opportunities and Challenges in Mapping Multiple CNNs

CNNs comprise a sequence of layers, organised as a feature extractor and a classifier stage. With the feature extractor dominating the computational cost and fully-connected layers limited in recent state-of-the-art models [10–12], this work focuses on the feature extractor. In the context of multiple CNNs, their characteristic structure presents opportunities for performance optimisation. The dataflow of a CNN consists of a feed-forward topology which can be modelled as a directed acyclic graph with one node per layer. Under this model, the dependencies between nodes and the workload of each node, including the ops/input, storage and memory bandwidth for weights and feature maps, are known a priori based on each layer’s type and configuration. This prior knowledge of compute and memory requirements enables (1) optimising at compile time the on-chip resource allocation between multiple CNNs and (2) generating an optimised static schedule for sharing the bandwidth to sustain high hardware utilisation.

To exploit effectively these CNN-specific opportunities, a fine-grained control over the customisation of the hardware is required. Fine-grained parametrisation would allow tailoring the allocation of on-chip resources to the potentially different performance needs of the CNNs. At the same time, control over the shared off-chip memory bandwidth would enable deriving a schedule that sustains a high utilisation of the architecture. Nevertheless, such a fine granularity leads to a large number of design parameters even for a single CNN. By scaling the problem to multiple models, the space of possible designs becomes combinatorially large. Thus, the complexity of mapping multiple CNNs on FPGAs necessitates a principled methodology in order to generate optimised designs.

III. PROPOSED FRAMEWORK

A high-level description of f-CNN’s flow is as follows. The deep learning specialist provides the set of CNNs in Caffe\(^3\) format, together with a target performance for each model, and the resources of the target FPGA platform. The Caffe descriptions are translated to a dataflow representation with one node per layer and passed to the design space exploration (DSE). The DSE employs a Synchronous Dataflow model of the multi-CNN hardware architecture and a memory scheduling policy to traverse the design space and optimise a multi-objective criterion that captures the user-specified performance for each CNN. After the highest performing design point is selected, f-CNN\(^3\) generates synthesised Vivado HLS code, which is compiled by the vendor’s toolchain.

A. Architecture

Fig. 1 shows the proposed multi-CNN architecture consisting of two components: a number of heterogeneous CNN engines and a multi-CNN hardware scheduler (MCNN-HS). Instead of scheduling the target set of CNNs sequentially over a fixed accelerator, the strategy of our framework is to generate one dedicated engine per CNN, customised to its workload and performance needs, allowing the concurrent execution of all models in an efficient way. The MCNN-HS module allocates the off-chip memory bandwidth to the CNN engines, with a static schedule as determined during the design space exploration. The scheduling of off-chip memory transactions and the design of MCNN-HS are detailed in Sec. III-C and III-D respectively.

CNN Engine. The hardware structure for each CNN engine can be either a core that processes each layer sequentially in a tiled manner (e.g. a matrix multiplication unit or a systolic array) or a streaming architecture. In the first case, the engines would have a fixed hardware template with customisable tile sizes. In the latter case, a streaming design would be parametrised with respect to the instantiated stages, their interconnections and the resource allocation among them. In this work, the streaming paradigm is adopted, to obtain a finer grain of control over the structure of each individual CNN engine. Each engine consists of a coarse pipeline of heterogeneous hardware stages, with each stage parametrised with respect to its parallelism-resource trade-off. The pipeline for each CNN can have a different structure, with a customisable sequence of stages based on the topology and the computational needs of the corresponding CNN (Fig. 1). Overall, the CNN engines operate under a data-driven scheme so that each stage computes whenever data arrive at its input.

The hardware stages are composed of modules for the convolutional, pooling and nonlinear layers. In the convolutional layer, we exploit the parallelism with respect to its outputs by tunably unrolling and instantiating one convolution processing element (C-PE) per output feature map, with the input feature maps processed in a pipelined manner. The output feature maps are parametrised to be folded, as shown in Fig. 1 so that C-PEs can be time-shared within a layer. Moreover, the dot-product circuit inside each C-PE can be tunably scaled (Fig. 1), from a single multiply-accumulate operator up to a fully parallel multiplier array with an adder tree. Pooling and nonlinear stages also have a tunable number of PEs, while operator-level folding can be applied on max and average units of pooling PEs. Under this parametrisation, each hardware stage has a tunable number of PEs, \(N_{PE} \in [1, N_{out}]\), where \(N_{out}\) is the maximum number of output feature maps it has to process, and a tunable number of operators, \(N_{op} \in [1, K^2]\), where \(K\) is the filter or pooling size depending on the type of layer, and can be optimised as dictated by the workload and the application-level performance requirements of the particular CNN.

With modern CNNs requiring an excessive amount of memory for their trained weights even for a single layer [10], we allow for the further folding of convolutional layers with

\(^3\)http://caffe.berkeleyvision.org/
respect to their inputs. Layers that exceed the on-chip storage of the target FPGA are tunably folded with respect to their input feature maps and the associated weights with a factor of \( f_{in} \in [1, N_{in}] \) which determines the tile size, where \( N_{in} \) is the number of input feature maps. This approach enables the on-chip compute and memory resources allocated for a convolutional layer to be time-multiplexed and the on-chip storage requirements to be accommodated by the target device.

**CNN Partitioning and Subgraphs.** The large depth and amount of weights often prohibit the direct mapping of each individual CNN to hardware. To sustain the utilisation of the architecture, we partition each CNN into subgraphs. The adopted partitioning scheme allows the partitioning along (1) the depth of the model and (2) the input feature maps of each convolutional layer, and requires each subgraph to contain at least one convolutional layer. With this formulation, the structure of each CNN engine is derived so that its datapath can execute all the subgraphs of the corresponding CNN. The partition points and the datapath for each engine are selected during the proposed design space exploration, described in Sec. III-B. Given a set of partitioned CNNs, the compute and memory requirements of each subgraph are known at compile time, based on the subgraph’s layers. As a result, the scheduling of the subgraphs on the corresponding engine as well as the memory transactions of the overall multi-CNN architecture can be statically optimised at compile time.

**B. Design Space Exploration**

Given a set of CNNs, the design space of possible mappings is formed by the free parameters of the architecture. These include (1) the partition points of each CNN, (2) the structure of each CNN engine, including the number and type of hardware stages and the connections between stages, (3) the compile-time configurable folding parameters of each stage \( (N_{PE}, N_{op}, f_{in}) \), and (4) the external memory bandwidth schedule. By defining such a large parameter space, our proposed framework trades off the capability of very fine-grained customisation that enables exploring a wide range of optimisations, at the cost of a combinatorial space of possible mappings. To capture each design point analytically and navigate efficiently the design space, we employ a Synchronous Dataflow (SDF) model [13] which considers the configuration of each design point to estimate performance, on-chip resource consumption and external memory bandwidth requirements.

**Performance Model.** Using the methodology described in [14], we develop an SDF model for the multi-CNN architecture. We model each CNN engine as an SDF graph \( G_{CE}=(V, E) \), with each node \( v \in V \) representing a hardware stage. The configuration of each stage in the CNN engine is captured with a tuple of the form \( (N_{PE}, N_{op}, f_{in}, T, \Gamma) \), with \( N_{PE}, N_{op} \) and \( f_{in} \) as defined in Sec. III-A and \( T \) the type of module. In this setting, each stage has a consumption rate \( R_{PE,N_{op}} \) elements/cycle and the CNN engine is equivalently represented with a topology matrix \( \Gamma \in \mathbb{R}^{[E|\times|V]} \) with \( \Gamma(e, v) \) holding the processing rate of node \( v \) on arc \( e \).

The workload of a CNN subgraph is captured with a workload matrix \( W \in \mathbb{Z}^{[E|\times|V]} \) with \( W(e, v) \) holding the elements to be produced or consumed by node \( v \) on arc \( e \). A partitioned CNN with \( N_{W} \) subgraphs is associated with a workload tuple \( W = (W_{i}, i \in [1, N_{W}]) \), with one matrix per subgraph. At each stage, the workload is \( f_{in}N_{out}K^{2}h_{out}w_{out} \) elements for convolutional and \( N_{out}K^{2}h_{out}w_{out} \) elements for pooling layers with \( N_{out} (h_{out} \times w_{out}) \)-sized output feature maps. In the case of \( N \) CNNs, the multi-CNN architecture is represented as \( G_{multiCE}=(G_{1CE}, ..., G_{NCE}) \) with multi-CNN topology and workload tuples \( \Gamma = (\Gamma_{i} \in \mathbb{R}^{[E_{i}|\times|V_{i}]}, i \in [1, N]) \) and \( W = (W_{i,j} \in \mathbb{Z}^{[E_{i}|\times|V_{i}]}, i \in [1, N], j \in [1, N_{W_{i,j}}]) \). The initiation interval matrix for the j-th subgraph of the i-th CNN is constructed as \( I_{i,j} = W_{i,j} \circ \Gamma_{i} \), and the execution time of a single (j-th) subgraph and all subgraphs of the i-th CNN on the i-th engine are given by Eq. (1) and (2) respectively:

\[
\begin{align*}
t_{i,j}(B, \Gamma_{i}, W_{i,j}) &= \frac{1}{\text{clock rate}} (D_{i} + \sum_{j=1}^{N_{W_{i,j}}} t_{i,j}(B, \Gamma_{i}, W_{i,j})) \\
t_{\text{total}}(B, \Gamma_{i}, W_{i,j}) &= \sum_{j=1}^{N_{W_{i,j}}} t_{i,j}(B, \Gamma_{i}, W_{i,j}) + \sum_{j=1}^{N_{W_{i,j}}} t_{i,j,\text{weights}}
\end{align*}
\]

where \( I_{i,j}^{\text{max}} \) is the maximum element of \( I_{i,j} \), \( B \) the batch size, \( D_{i} \) the pipeline depth of the i-th CNN engine and \( t_{i,j,\text{weights}} \) the time to load the weights of the j-th subgraph of the i-th CNN. Moreover, the latency of the j-th subgraph on the i-th engine is given by \( L_{i}(B=1, \Gamma_{i}, W_{i,j}) = t_{i,j}(1, \Gamma_{i}, W_{i,j}) \).

**Search Method.** Fig. 2 shows the proposed DSE method. First, by exploring the design space of each individual CNN on the resource budget of the target FPGA, the design points on the latency-resource Pareto front of each CNN are found, without accounting for the shared bandwidth to the external memory. Each individual design point corresponds to different (1) partitioning of the CNN, (2) structure of the pipeline and (3) folding factors for each hardware stage, and is characterised by its performance, on-chip resource consumption and its workload, including the computational and off-chip memory bandwidth requirements of its subgraphs.

As a next step, f-CNN+ performs an enumeration of all the combinations of design points that belong to the Pareto fronts of individual CNNs to obtain joint design points, denoted by \( \sigma \). The combinations that do not lie in the feasible space of the target FPGA are discarded based on their aggregate on-chip resource consumption as \( \sum_{i=1}^{N} r_{sc}(\sigma_{i}) \leq r_{sc,\text{Avail.}} \), where \( \sigma_{i} \) denotes the hardware design for the i-th CNN, \( N \) the number of CNNs and \( r_{sc}(\sigma_{i}) \) the resource consumption vector, including LUTs, Flip-Flops, DSPs and BRAMs. Next, the scheduler module (Fig. 2) takes into account the sharing of the bandwidth and traverses the feasible space to search for the (joint design point, memory transfers schedule) pair that optimises a user-defined objective function. After the highest performing joint design point has been selected, the corresponding multi-CNN architecture is implemented using an automated code generation mechanism.

**C. Scheduler**

The scheduler is responsible for taking into account the effect of the shared memory bandwidth and identifying the
highest performing design for the multi-CNN architecture based on a user-defined objective function. This module takes as input the joint design points of the Pareto front and predicts the actual performance of each point after scheduling the memory transfers. In this respect, the quality of the memory transfers schedule affects substantially the utilisation of the architecture, especially in cases with high bandwidth contention.

To this end, we cast the time-sharing of the external memory bandwidth as a cyclic scheduling problem due to the constant stream of new inputs to the CNNs. Based on this formulation, a set of tasks, in this case CNN inferences, have to be performed repeatedly. The solution of the cyclic scheduling problem would yield a schedule for all tasks in the presence of precedence and resource sharing constraints. In our formulation, the precedence constraints include the dependencies between the subgraphs of each CNN and resource sharing focuses on the off-chip memory bandwidth. Moreover, we require our solution to be periodic with a fixed period, named cycle time, and hence allow each CNN to repeat multiple times during one cycle time. Formally, we pose the following cyclic scheduling problem.

**Inputs:**
- \( N \): the number of CNNs,
- \( N_{W_i} \), \( i \in [1, N] \): the number of subgraphs of each CNN,
- \( S = \{ s_{i,j} \mid i \in [1, N], j \in [1, N_{W_i}] \} \): the set of subgraphs,
- \( L(s) \): the latency of each subgraph,
- \( b(s) \): the memory bandwidth usage for each subgraph,
- \( s_{i,j} < s_{i,j+1},... \): the set of precedence constraints on subgraphs,
- \( K \): the cycle time (or schedule period),
- \( rep(i), i \in [1, N] \): the repetitions of each CNN inference in a cycle time,
- \( B_{mem} \): the available memory bandwidth.

By allowing multiple repetitions of each CNN within a cycle time, the augmented set of subgraphs becomes:

\[
S_{aug} = \{ s_{i,j} \mid i \in [1, N], j \in [1, rep(i)N_{W_i}] \}
\]

**Decision variables:**
- \( st(s) \in [0, K) \), \( s \in S_{aug} \): start time of each subgraph.

In addition, we define the following constraints:

1) All subgraphs must be scheduled and the start time of each subgraph must lie within the cycle time:
\[
0 \leq st(s) < K, \ s \in S_{aug}
\]

2) If subgraph \( s_i \) precedes \( s_j \), then start time of \( s_j \) must occur after the end time of \( s_i \) within the cycle time:
\[
s_i < s_j \Rightarrow st(s_i) + L(s_i) < st(s_j)
\]

3) The memory bandwidth utilisation of subgraphs that are scheduled during the same slot must not exceed the available bandwidth, to minimise contention.

**Slow-down Scheduler.** As described in Sec. II-B due to the structure of CNNs, the scheduling of memory transfers offers an opportunity for optimisation. Although the on-chip resources constitute a hard constraint which cannot be violated by the aggregate consumption of the CNN engines, memory bandwidth is a soft constraint and can be violated from a design by requiring more bandwidth than is available. Nevertheless, bandwidth violations lead to memory contention between the CNN engines, and therefore, if allowed, the estimated performance from the performance model would be different to the actual measured performance, making the DSE irrelevant. Additionally, if we impose bandwidth as a hard constraint and schedule the subgraphs to ensure no violations, the bandwidth will be underutilised, due to the conservative scheduling and the discrete nature of the subgraphs. To alleviate this, we introduce a control mechanism over the processing rate of each CNN engine at any time instant, which is optimised to remove memory violations while maximising bandwidth utilisation.

Classic scheduling algorithms, such as Integer Linear Programming (ILP) and heuristic schedulers, treat each schedulable unit in a faithful manner, without modifying its execution time and bandwidth requirements. Due to this property, such schedulers do not exhibit the flexibility and expressive power that can exploit the per-cycle deterministic control offered by FPGAs over memory transfers. To this end, we propose a rate-controlling scheduler which controls the processing rate of each CNN engine at any instant. We model this by introducing an additional set of decision variables to our cyclic scheduling problem, under the name slow-downs, defined as in Eq. (3). 

\[
s_{i,j} \in (0, 1], i \in [1, N], j \in [1, rep(i)N_{W_i}] \]

We interpret slow-downs as a control factor over the bandwidth allocated to each CNN engine at each time instant. With the pipelines of our architecture operating under a data-driven paradigm, a slower input data rate would slow down the processing speed of an engine and, at the same time, reduce the bandwidth requirements imposed on the off-chip memory by a particular subgraph (Eq. (4)). As a result, with this formulation, a subgraph with bandwidth violations can be slowed down and potentially be scheduled more efficiently to better reflect the actual attainable performance upon deployment.

Fig. 3 illustrates the potential benefits of slow-downs in the case of three CNNs. The bottom left image shows the predicted performance if no slow-downs were introduced and no bandwidth violations were allowed. In this scenario, the aggregate required bandwidth of the three subgraphs exceeds the available budget by 1.25× and the subgraphs cannot be scheduled in parallel without causing contention, leading to the schedule depicted on the bottom left of Fig. 3. By applying slow-down factors of 0.8, 0.8 and 0.75 respectively, 80% of the required bandwidth is supplied to the first two subgraphs and 75% to the third and, in this way, the processing rate of each CNN engine is decreased proportionally. This approach decreases the aggregate required bandwidth to the feasible 1.96 GB/s, leading to a shorter schedule.

The extension of the multi-CNN cyclic scheduling problem to include slow-downs expands further the number of design parameters that we have to optimise, leading to a more complex design space. To solve the scheduling problem, we treat it as multiobjective optimisation (MOO) with an objective function that assesses the quality of a joint design point after scheduling. The objective function is user-defined and can be selected to capture the application-level importance of each CNN. Two characteristic objective functions are shown below.
**Objective Function 1. FPSobj**: Optimise the multi-CNN mapping to achieve the target frame rate in frames per second (fps) for each CNN, with equal importance across the CNNs.

\[
\min_{\{\sigma_i\}_{i=1}^{N}} \sum_{i=1}^{N} \left( \frac{fps_i - fps_{targ, i}}{fps_{targ, i}} \right)^2 
\]

s.t. \( \sum_{i=1}^{N} \sigma_i \leq rsc_{\text{Avail}} \)

where \( fps_i(\sigma_i) \) is the fps of the \( \sigma_i \) design point of the \( i \)-th CNN given the shared bandwidth constraints, \( fps_{targ, i} \) is set to \( \min(fps_{\text{user}, i}, fps_{\text{max}, i}) \), i.e. the minimum between the user-defined target fps and the maximum attainable fps for the \( i \)-th network on the target platform. The fps of each design point \( \sigma_i \) is divided by the target fps to obtain a non-dimensional objective function and place equal weight to all the CNNs.

**Objective Function 2. MaxThrt**: Optimise the multi-CNN mapping to achieve the maximum throughput in GOP/s for each CNN that lies in the joint design space.

\[
\min_{\{\sigma_i\}_{i=1}^{N}} \sum_{i=1}^{N} \left( \frac{T(\sigma_i) - T_{\text{max}, i}}{T_{\text{max}, i}} \right)^2
\]

s.t. \( \sum_{i=1}^{N} \sigma_i \leq rsc_{\text{Avail}} \)

where \( T(\sigma_i) \) denotes the throughput of the \( \sigma_i \) design point of the \( i \)-th CNN in GOP/s given the shared bandwidth constraints and \( T_{\text{max}, i} \) the maximum attainable throughput for the \( i \)-th CNN on the target FPGA. The throughput of each \( \sigma_i \) is divided by the maximum throughput to obtain a non-dimensional objective function and place equal weight to all the CNNs.

The resource-constrained cyclic scheduling problem has been proven to be NP-hard [16]. In our multiple CNN formulation of Sec. III-C which is used to obtain a schedule for each multi-CNN design point, the size of the problem is proportional to the number of subgraphs to be scheduled. For small-sized problems, we model the problem as an integer linear program (ILP) and employ an ILP solver to obtain the optimal solution. The excessive runtime of ILP solvers sets a limit on the scale of solvable problems and, therefore, in such cases, a heuristic scheduler is required to obtain a solution. To this end, we developed a heuristic scheduler that combines Resource Constrained List Scheduling (RCLS) [17] with slow-downs. With this approach, given a joint design point and a set of slow-downs, the lowest latency schedule is obtained.

---

**Algorithm 1: Memory-aware DSE for multiple CNNs**

Input: Set of joint design points \( \Sigma \) in the feasible space

Objective function \( F(\sigma, sl_i), \sigma \in \Sigma \)

Off-chip memory bandwidth budget \( B_{\text{racm}} \)

Output: Joint design point \( \sigma^* \) chosen for the architecture

Optimised slow-down factors \( sl^* \) for \( \sigma^* \)

1. foreach joint design point \( \sigma \in \Sigma \) do

   /\* - - - slow-down initialisation proposals - - - */
   2. \( \text{sched}_\text{init} \leftarrow \text{RCLS}(\sigma) \)
   3. \( \text{if} \) Without bandwidth constraints \( \text{viol}(s) \leftarrow \text{Violations}(s, \text{sched}_\text{init}, B_{\text{racm}}) \)
   4. \( \text{If} \) \( s \in \sigma \) \( \text{viol}(s) \leftarrow \text{RemoveViolations}(s, \text{viol}(s)), \forall s \in \sigma \)

   /\* - - - slow-down search - - - */
   5. Apply a pattern search algorithm over the slow-downs to optimise for \( F : [\text{sl}, F(\sigma, sl)] \leftarrow \text{PatternSearch}(\sigma, \text{sl}_0, B_{\text{racm}}, F) \)
   6. \( \text{if} \) \( F \) improved \( \text{then} \)
   7. \( \sigma^* \leftarrow \sigma; \text{sl}^* \leftarrow \text{sl} \)
   8. end

**Memory-aware DSE.** To select the highest performing schedule for each point, we developed an iterative, derivative-free pattern search (PS) optimiser [18] that, given a joint design point \( \sigma \), memory bandwidth budget \( B_{\text{racm}} \), initial slow-down vector \( \text{sl}_0 \) and target objective function \( F \), searches over slow-downs. At each 2-step iteration, the optimiser first explores neighbouring solutions of the slow-down vector \( \text{sl} \) in a finite number of directions. If a solution that improves \( F \) is found, the optimiser updates the slow-down values. Else, a polling step is performed to search for candidate solutions farther away from the current \( \text{sl} \). The PS algorithm requires a large number of direct evaluations of \( F \), which are efficiently performed by means of the slow-down scheduler and the SDF performance model (Sec. III-B). In this manner, the highest performing schedule in terms of \( \text{sl} \) is obtained for each \( \sigma \).

Algorithm 1 presents the overall memory-aware DSE, searching over both on-chip resources and off-chip memory bandwidth allocations. The DSE searches over different on-chip resource allocations between CNN engines (line 1). For each allocation, the highest performing schedule is found by means of the PS optimiser (lines 7-8). Prior to the optimiser, a greedy strategy is employed to generate slow-down proposals (lines 3-5) that place \( \text{sl}_0 \) in a region of the design space with no violations, in order to facilitate the slow-down search. At the end of the loop, the (architecture, schedule) pair that optimises \( F \) is selected. Further details of the slow-down scheduler and the PS optimiser are omitted due to space constraints.

To illustrate the impact of the proposed memory-aware scheme, Fig. 4 depicts how the memory-aware design shifts the candidate joint design points to regions with improved objective function values for benchmark 7 of Table III. The horizontal axis shows the average resource usage across LUTs, FFs, DSPs and BRAMs on Zynq XC7Z045. The explored joint design points appear in (blue, red, yellow) triplets. The points of a triplet have the same on-chip resource allocation, but different scheduling. Blue points correspond to the peak performance if each CNN engine had access to the full platform bandwidth. Red points show the case when each engine attempts to access the external memory asynchronously. In contrast to the contention-unaware red points, the memory-aware design enables yellow points to tailor the memory access policy to the target multiobjective criterion and match it to the performance requirements of each CNN, and as a result outperform red points.
Memory-aware scheduling
Contention-unaware scheduling
Full platform available bandwidth for each CNN engine
Same CNN engines

Read Memory Controller
Off-chip Memory
Read Staging Buffer
FIFO
Subgraphs...
… FIFO
1
FIFO
N...
f-CNNx Accelerator
Burst Length
Burst Length
Address
Transfer Size
MCNN-HS

Fig. 4: Effect of the proposed DSE (Table III, benchmark 7).

D. Multi-CNN Hardware Scheduler

The selected schedule is mapped to hardware with a rate-controlling mechanism and a multi-CNN hardware scheduler.

Rate-controlling Mechanism. To implement a (schedule, slow-downs) pair, each CNN engine has to be supplied a specific fraction of the available bandwidth at each time instant. To this end, we discretise time into slots of equal size. During a slot, only a single CNN engine is allocated the available bandwidth, with all engines served in a round-robin fashion. By allowing the CNN engines to occupy several consecutive slots, a tunable fraction of the available bandwidth is provided to each engine during each period of slots as given by Eq. (7).

\[
B(s_{i,j}) = \frac{\text{slots}(s_{i,j})}{\# \text{slotsTotal}} \cdot B_{\text{mem}}, \ i \in [1, N], j \in [1, N_W]\]

where \( B(s_{i,j}) \) is the average supplied bandwidth and \text{slots}(s_{i,j}) is the number of consecutive slots assigned during the execution of the j-th subgraph by the i-th CNN engine. With this formulation, to comply with a selected (schedule, slow-downs) pair, the supplied bandwidth \( B(s_{i,j}) \) has to be equal to the required bandwidth \( b'(s_{i,j}) \) (Eq. 4) for all subgraphs. Hence, the values of \text{slots}(s_{i,j}) are found by solving Eq. (7) with \( B(s_{i,j}) \) set equal to \( b'(s_{i,j}) \). Finally, the size of each slot in cycles is equal to the selected burst length for the memory transfers and is discussed in the following section.

Microarchitecture. Key enabler of the proposed design is the MCNN-HS module that is responsible for interfacing the CNN engines with the external memory. Fig. 5 shows the microarchitecture of MCNN-HS. The selected schedule is encoded into a compile-time configuration of MCNN-HS by means of the rate-controlling mechanism. The MCNN-HS communicates with the external memory via two memory controllers and hosts two staging buffers that mediate between the external memory and the FIFOs of the CNN engines. The sizes of the staging buffers are determined based on the largest on-chip storage requirement among the target subgraphs. Moreover, the FIFOs are employed to smooth out the time discretisation of the external memory accesses, so that the CNN engines see a continuous flow of data, instead of bursts, with their depth configured based on the processing rate of each engine.

MCNN-HS comprises a configuration table and a control unit (CU). The configuration table stores encoded information for each subgraph about the amount of data to be transferred, the allocated number of consecutive slots and the off-chip memory addresses, with the contents of the table determined at compile time by the rate-controlling mechanism. The CU is responsible for orchestrating the multi-CNN schedule at run time. A subgraphs register is used to keep track of the currently active subgraph for each CNN and to look up the appropriate

![Fig. 5: Microarchitecture of the multi-CNN hardware scheduler](image-url)

---

Table I: Benchmarks

<table>
<thead>
<tr>
<th>Model Name</th>
<th>Layers</th>
<th>Workload</th>
<th>Task</th>
</tr>
</thead>
<tbody>
<tr>
<td>LeNet-5 (Caffe version)</td>
<td>4</td>
<td>0.0038 GOps</td>
<td>Digit Recognition</td>
</tr>
<tr>
<td>CIFAR-10</td>
<td>9</td>
<td>0.0247 GOps</td>
<td>Object Recognition</td>
</tr>
<tr>
<td>PilotNet</td>
<td>10</td>
<td>0.0620 GOps</td>
<td>Wheel Stirring</td>
</tr>
<tr>
<td>ZFNet</td>
<td>10</td>
<td>2.2219 GOps</td>
<td>Object Detection</td>
</tr>
<tr>
<td>SceneLabelCNN</td>
<td>8</td>
<td>7.6528 GOps</td>
<td>Scene Labelling</td>
</tr>
<tr>
<td>VGG16</td>
<td>31</td>
<td>30.7200 GOps</td>
<td>Scene Recognition</td>
</tr>
</tbody>
</table>

IV. Evaluation

A. Experimental Setup

In our experiments, we target the ZC706 board mounting the Zynq XC7Z045 SoC, with a clock rate of 150 MHz. All hardware designs were synthesised and placed-and-routed with Xilinx’s Vivado Design Suite (v17.2) and run on the ZC706 board. The ARM CPU was used to measure the performance of each design. For the evaluation, Q8.8 16-bit precision was used which has been studied to give similar results to 32-bit floating-point [6]. In each multi-CNN benchmark (Tables II and III), the available bandwidth was controlled by using a different number of memory ports and amount of word packing.

Table II lists our benchmark CNNs. LeNet-5 and CIFAR-10 have comparatively small workloads and are employed to evaluate the RCLS against the optimal ILP scheduler. PilotNet, ZFNet, SceneLabelCNN and VGG16 pose mapping challenges such as the non-uniform filters of ZFNet, the large filters of SceneLabelCNN and the computational intensity of VGG16. Moreover, ZFNet and VGG16 are used for numerous object

---

3By investigating the impact of burst length on bandwidth utilisation efficiency, a burst length of 1024 was selected for MCNN-HS, achieving higher than 90% measured efficiency on ZC706.
In the DSE of f-CNN architecture, each engine is connected to a dedicated DMA engine, with all DMA engines running asynchronously.

### Table III: Comparison of f-CNNs and baseline FPGA accelerator without the proposed scheduling (batch size of 1)

<table>
<thead>
<tr>
<th>ID</th>
<th>Benchmark</th>
<th>Model Set</th>
<th>Available Bandwidth</th>
<th>Baseline (GOp/s)</th>
<th>f-CNN (GOp/s)</th>
<th>Speed-up (% gain)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>3 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>0.5 GB/s</td>
<td>(15.43, 26.81, 16.40)</td>
<td>(13.97, 60.14, 48.27)</td>
<td>77% 42%</td>
</tr>
<tr>
<td>2</td>
<td>3 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>1.0 GB/s</td>
<td>(17.03, 91.23, 26.15)</td>
<td>(19.92, 85.71, 68.90)</td>
<td>42% 51%</td>
</tr>
<tr>
<td>3</td>
<td>3 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>1.5 GB/s</td>
<td>(22.58, 87.48, 39.01)</td>
<td>(21.45, 92.30, 74.08)</td>
<td>24% 38%</td>
</tr>
<tr>
<td>4</td>
<td>3 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>2.0 GB/s</td>
<td>(22.70, 96.22, 48.76)</td>
<td>(23.05, 99.21, 79.69)</td>
<td>19% 37%</td>
</tr>
<tr>
<td>5</td>
<td>4 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>0.5 GB/s</td>
<td>(8.12, 0.72, 33.58, 11.22)</td>
<td>(10.39, 1.26, 47.71, 47.87)</td>
<td>91% 54%</td>
</tr>
<tr>
<td>6</td>
<td>4 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>1.0 GB/s</td>
<td>(13.51, 1.27, 58.14, 23.33)</td>
<td>(21.81, 1.87, 72.91, 48.77)</td>
<td>57% 45%</td>
</tr>
<tr>
<td>7</td>
<td>4 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>1.5 GB/s</td>
<td>(16.00, 1.47, 68.11, 30.37)</td>
<td>(20.00, 1.95, 68.86, 69.08)</td>
<td>46% 40%</td>
</tr>
<tr>
<td>8</td>
<td>4 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>2.0 GB/s</td>
<td>(15.46, 1.61, 85.14, 37.96)</td>
<td>(16.28, 1.97, 93.43, 75.00)</td>
<td>29% 32%</td>
</tr>
</tbody>
</table>

B. Evaluation of Proposed Scheduler

In this section, the quality of the proposed RCLS-based scheduler is evaluated. This is investigated by using the MaxThrpt criterion (Eq. 6) to generate multi-CNN hardware designs using both the RCLS and the ILP schedulers and measuring the real achieved value on the target FPGA board. The comparisons are performed on small-scale problems in order for the ILP solver to yield a solution in a tractable amount of time, where the scale is defined as the number of subgraphs to be scheduled. We employ the low-end LeNet-5 and CIFAR-10 and compare across six settings by varying the number of CNNs and the available bandwidth. Table IV presents the measured results on ZC706. The selected multi-CNN designs were implemented and run on the target platform and the measured performances were used to yield the achieved MaxThrpt.

### Table IV: Comparison of f-CNNs and baseline FPGA accelerator without the proposed scheduling (batch size of 1)

<table>
<thead>
<tr>
<th>ID</th>
<th>Benchmark</th>
<th>Model Set</th>
<th>Available Bandwidth</th>
<th>Baseline (GOp/s)</th>
<th>f-CNN (GOp/s)</th>
<th>Speed-up (% gain)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>0.5 GB/s</td>
<td>(0.395 / 5.8 s)</td>
<td>(0.395 / 3.6 s)</td>
<td>-</td>
</tr>
<tr>
<td>2</td>
<td>2 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>1.0 GB/s</td>
<td>(0.254 / 5.9 min)</td>
<td>(0.254 / 3.6 s)</td>
<td>-</td>
</tr>
<tr>
<td>3</td>
<td>2 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>1.5 GB/s</td>
<td>(0.254 / 3.6 s)</td>
<td>(0.254 / 3.6 s)</td>
<td>-</td>
</tr>
<tr>
<td>4</td>
<td>3 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>4.4 GB/s</td>
<td>(0.946 / 5.8 min)</td>
<td>(0.946 / 2.5 min)</td>
<td>-</td>
</tr>
<tr>
<td>5</td>
<td>4 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>4.4 GB/s</td>
<td>(1.875 / 1h)</td>
<td>(1.875 / 1h)</td>
<td>-</td>
</tr>
<tr>
<td>6</td>
<td>4 CNNs</td>
<td>LeNet-5, CIFAR-10</td>
<td>4.4 GB/s</td>
<td>(1.875 / 1h)</td>
<td>(1.875 / 1h)</td>
<td>-</td>
</tr>
</tbody>
</table>

### D. Comparison with Embedded GPU

With a large number of CNNs being deployed for inference in multi-tasking embedded systems, our evaluation focuses on the embedded space. In power-constrained applications, the primary metrics of interest comprise (1) the absolute power consumption and (2) the performance efficiency in terms of performance-per-Watt. In this respect, we investigate the performance efficiency of f-CNNs on Zynq XC7Z045 in relation to the widely used NVIDIA Tegra X1 (TX1). To comply with the stringent latency needs of modern systems, both the FPGA and GPU designs use a batch size of 1.

For the performance evaluation on TX1, we use NVIDIA TensorRT as supplied by the JetPack 3.1 package. TensorRT is run with the cuDNN library and 16-bit half-precision floating-point arithmetic (FP16) which enables the highly optimised execution of layers. In each benchmark, the TensorRT implementations of the target CNNs are scheduled over the GPU in a rotational and periodic manner. Across all the platforms, each multi-CNN benchmark is run 100 times to obtain the average performance. Furthermore, power measurements for the GPU and the FPGA are obtained via a power monitor on the corresponding board. In all cases, we subtract the average idle power as measured on the ZC706 board level with no design programmed in the FPGA fabric, so that the clock tree power and the power leakage of the chip are also included in the run-time power due to benchmark execution.

The idle power of the ZC706 platform is measured at the board level with no design programmed in the FPGA fabric, so that the clock tree power and the power leakage of the chip are also included in the run-time power due to benchmark execution.
customisable multi-CNN architecture together with an external memory access policy, the proposed toolflow tailors the allocation of both compute resources and external memory bandwidth to the performance requirements of the target set of CNNs. Evaluation shows that f-CNN achieves performance gains of up to 50% over mappings that allow memory contention and delivers up to 6.8× higher performance-per-Watt over highly optimised embedded GPU designs. To the best of our knowledge, this work introduces for the first time in the literature the mapping of multiple CNNs. Future work will explore the mapping of multiple CNN workloads in cloud environments.

### REFERENCES