Hierarchical Bayesian Modelling for Knowledge Transfer Across Engineering Fleets via Multitask Learning

A population-level analysis is proposed to address data sparsity when building predictive models for engineering infrastructure. Utilising an interpretable hierarchical Bayesian approach and operational fleet data, domain expertise is naturally encoded (and appropriately shared) between different sub-groups, representing (i) use-type, (ii) component, or (iii) operating condition. Specifically, domain expertise is exploited to constrain the model via assumptions (and prior distributions) allowing the methodology to automatically share information between similar assets, improving the survival analysis of a truck fleet and power prediction in a wind farm. In each asset management example, a set of correlated functions is learnt over the fleet, in a combined inference, to learn a population model. Parameter estimation is improved when sub-fleets share correlated information at different levels of the hierarchy. In turn, groups with incomplete data automatically borrow statistical strength from those that are data-rich. The statistical correlations enable knowledge transfer via Bayesian transfer learning, and the correlations can be inspected to inform which assets share information for which effect (i.e. parameter). Both case studies demonstrate the wide applicability to practical infrastructure monitoring, since the approach is naturally adapted between interpretable fleet models of different in situ examples.


Introduction
Data sparsity can cause significant issues in practical applications of reliability, performance, and safety assessment.Particularly structural monitoring [1], prognostics [2], or performance and health management [3].In these domains, comprehensive (or high variance [4]) data are rarely available a priori ; instead, measurements arrive incrementally, throughout the life-cycle of the monitored system [5].For example, the data recorded from the system in unusual environments, or following damage, might take years to collect.Labelling to annotate the measurements can also be limited or expensive, requiring input from a domain expert.Such incomplete data motivate sharing information between similar assets; specifically, whether systems with comprehensive data (or established models) can support those with incomplete information.
The concept of knowledge transfer, from one machine to another, has led to the development of population-based [6][7][8] or fleet monitoring [9].Initial investigations (mostly) consider the quantification of similarity between systems [7] and tools for the transfer of data and/or models from source to target domains [10][11][12].An alternative approach is considered here, whereby a combined inference is made given the measurements from a collected group of systems [13].Specifically, a set of correlated, hierarchical models is learnt, given the information recorded from the collected population.Two case studies are presented: survival analysis of an operational truck fleet and wind-power predictions for an operational wind farm.Populationlevel models are learnt using hierarchical Bayesian modelling [14,15] providing robust predictions and variance reduction compared to independent mod-els and two benchmarks.The multi-task learning approach [14,16] automatically shares information between correlated domains (i.e.sub-groups) such that assets with sparse information borrow statistical strength from those that are data-rich (via correlated variables).

Why learn fleet models?
Throughout this work, the term fleet refers to a population of assets that constitute engineering infrastructure.For example, civil structures (bridges and roads) or vehicles (trains in a rail network).The problem setting from each case study is introduced here to motivate multitask learning from in-situ fleet data.

Truck fleets
The first example concerns the survival analysis of components (alternators and turbochargers) in a fleet of heavy-duty trucks maintained by Scania CV.The components are maintained in a run-to-failure strategy as failure models are unavailable and it is infeasible for drivers to sense incipient failure.Nonetheless, the associated downtime can incur high costs: relating to late goods delivery, re-loading, and towing vehicles to the workshop.
For such components, survival analysis [2] is critical to estimate the time to failure, and therefore fundamental when designing a maintenance plan.The analysis considers failure occurrences in the population over some specified time period.The period must be sufficiently long, such that reliability functions can be evaluated based on observed failures or drop-outs [17].Specifically, this work focusses on the hazard function λ(t) which defines the instantaneous rate of failure -it is the probability P (•) of a component failing at time t, given that it has survived until time t [17], here T denotes the time of failure.Empirically, this is calculated as the fraction of trucks that failed to the number of trucks that survived, in a given time interval.
Importantly, each sample from the reliability function requires at least one failure in the historical fleet data.For this reason, if failures are rare in certain sub-fleets, the data that represent the corresponding function will be sparse -Figure 3 later visualises this.If sub-fleets with more failures can inform predictions in groups where failures are rare, this greatly extends the value of the measured data (and the failure events themselves).

Wind farms
The second case study considers power prediction for a group of operational wind turbines.Here, the regression tasks are power curves, which map from wind speed to power output for a specific turbine [18].The associated function can be used as an indicator of performance and is useful in monitoring procedures [19].Data-based methods approximate this relationship from operational measurements, typically recorded using Supervisor Control and Sensory Data Acquisition (SCADA) systems [20].
Various techniques have been proposed to model data that correspond to normal operation [21][22][23].
In practice, however, only a subset of measurements represent this relationship.In particular, power curtailments will appear as additional functional components; these usually correspond to the output power being controlled (or otherwise limited) by the operator.Reasons for this action include: adhering to the requirements of the electrical grid [24,25], the mitigation of loading/wake effects [26], and restrictions enforced by planning regulations -such data are presented in Figure 16.
Critically, different turbines experience different conditions (i.e.power curves) at varying intervals.If the power of a particular turbine is regularly limited by the operator (as a result of its location in the farm) measurements collected from this operation become far more valuable when they can be shared between turbines.In this case, fleet modelling can be adopted to share (or pool ) information.

Novelty
In view of these applications, the proposed fleet modelling approach favours explainability (with some caveats) since each model is informative.
• Rather than black-box, a fleet model is built while encoding multilevel a priori knowledge of fleet behaviour and model constraints, given domain expertise.
• The proposed model automatically determines the level of knowledge transfer between asset groups, learning the inter-task correlations from data and combining this with a priori engineering knowledge.
• In turn, the approach provides formal uncertainty quantification of the fleet effects (parameters) at various asset group granularities (system-specific, operating condition, or population-wide).
• Each subgroup predictor shares information and the associated fleet model provides new insights, which are greater than the sum of its parts (single-task learning).
Such fleet models are desirable, since they enable downstream analyses, to determine which groups of assets share information for which (interpretable) parameter; additionally, the model naturally integrates with experimental design or decision processes; formalising the expected optimal action, or the value data collection activities -these concepts are demonstrated in the second case study, Section 6.4.
The approach is particularly suited to sparse, incremental data, that are found in many (practical) monitoring applications -for example, in the first (survival analysis) case study, one domain owns a single training observation.

Layout
The paper layout is as follows.Section 2 summarises existing work relating to population monitoring of engineering systems.Section 3 states the contributions of this work.Section 4 introduces a general methodology for knowledge transfer via hierarchical Bayesian modelling.Sections 5 and 6 present the truck fleet and wind farm case studies.Section 7 offers concluding remarks.

Related Work
A summary of fleet-monitoring literature is provided.
The term knowledge transfer is used generally to refer to methods that learn from multiple related datasets.Specific definitions of transfer learning are contentious: this work follows Murphy [16] which views multitask learning (MTL) as the combined inference of a set of related tasks, while domain adaptation (DA) is a method of transforming data, such that the same task can be learnt for multiple domains.Both approaches are considered transfer learning -especially when domains share interpretable, parametrised models.

Fine-tuning and domain adaptation
When monitoring engineering populations, the majority of literature focusses on transfer learning.
Transfer learning seeks to improve predictions in a target domain given the information in a (more complete) source domain.Many examples consider crack detection via image classification using Convolutional Neural Networks (CNNs).For example, Dorafshan et al. [27], Gao and Mosalam [28], Jang et al. [29] detect cracks over a number of domains by fine-tuning the parameters of a CNN trained on a source domain to aid generalisation in the target.
Domain Adaptation is viewed as another variant of transfer learning in engineering applications (DA) [30][31][32].These techniques define some mapping from domain data into a shared space (possibly one of the original domains) where a single model is used to make predictions.For example, Michau and Fink [10] apply a neural network mapping for DA in the condition monitoring of a fleet of power plants.
DA has also been investigated by (kernelised) linear projection, discussed in a structural health monitoring context by Gardner et al. [33,34] considering methods for knowledge transfer between simulated source and target structures, as well as a simulated source and experimental target structure [35].Damage detectors have also been transferred between systems via DA in a group of tailplane structures using ground-test vibration data [11].To accommodate for class imbalance and data sparsity, often associated with monitoring data, Poole et al. [36] introduce statistic alignment methods for adaptation procedures.

Multi-task learning
An alternative view of population-level models considers multi-task learning (MTL).While the multitask approach also assumes the predictors (i.e.tasks) are correlated over the fleet, the parameters across domains are learnt at the same time with equal importance.A combined inference allows domainspecific models to share information between related tasks, improving the accuracy in domains where data are limited [37].
Examples of multi-task learning are less prevalent when modelling engineering infrastructure.Wan and Ni [38] successfully use a Gaussian process (GP) to learn correlations between tasks in a multi-output regression.The GP is built using a carefully specified kernel [39] to capture the task and inter-task relationships.The experiments capture correlations between temperature/acceleration sensing systems on a single structure (the Canton Tower), rather than multiple assets in a fleet.Similarly, Li et al. [40] apply correlated GPs to address the missing data problem over multiple sensors of a hydroelectric dam.The results demonstrate successful knowledge transfer between measurement channels.Considering aerospace engines, Seshadri et al. [41] apply GPs for knowledge transfer between multiple axial measurement planes when interpolating temperature fields within an aircraft engine.Sharing information between planes significantly improves the spatial representation of the response.
Hierarchical Bayesian modelling offers another multitask framework.A model is built with a 'hierarchy' of parameters, whereby domain-specific tasks are correlated via shared latent variables (explained in Section 4).The approach was introduced to structural monitoring by Huang et al. [42] and Huang and Beck [43] who utilise hierarchical models to learn multiple, correlated regression models for modal analysis.A shared sparseness profile is inferred over all tasks and related measurement channels, improving damage detection and data recovery by considering the correlation between damage scenarios or adjacent sensors on the same structure.Some recent, related applications include Di Francesco et al. [44], who use hierarchical models to build corrosion models given evidence from multiple locations, and Papadimas and Dodwell [45], where the results from a series of materials experiments (i.e. coupon samples) are combined to inform the estimation of material properties.Also, Dhada et al. [13] implement hierarchical Gaussian mixture models to cluster simulated data that represent novelty detection for asset management; the model parameters are interpretable in terms of the data distribution, rather than the application domain.

Wider monitoring methods
It is worth considering more general developments in the literature, and how they relate to fleet monitoring.Multi-task neural networks, in particular, show promise when the size (or features) of monitoring data permit their application; e.g.Zhang et al. [46] design a deep architecture for guided wave datasets.Similarly, Tsialiamanis et al. [47] successfully investigate neural networks for knowledge transfer by mapping measurements from multiple structures onto a common manifold, to learn a shared representation.
A primary motivation of this work, however, is to consider structures/domains with very sparse (or absent) data -e.g.those recently in operation, or new environmental conditions.In turn, model comparisons here are limited to parametric (or shallow [48]) methods of knowledge transfer, centred around interpretable models -each benchmark is outlined in Section 4.4.

Bayesian vs 'deep' knowledge transfer
The distinction between hierarchical (Bayesian) and deep (neural network) approaches to transfer learning is important.The differences emphasise why, in many applications, the proposed (hierarchical) method is required for infrastructure monitoring.
• Both address relative data sparsity (between domains) however, the level of sparsity is method dependent: generally, deep methods are suited to complex features and big data; hierarchical methods are suited to standard measurements and interpretable models.
• Both improve predictions over multiple asset groups; however, the proposed hierarchical approach provides uncertainty quantification of the nested subgroups, enabling downstream (statistical) analyses -e.g.experimental design or decision processes (demonstrated in Section 6.4).
• Encoding domain (engineering) expertise is natural for multilevel Bayesian models -for example, the knowledge that all turbines in a wind farm have the same maximum power, but the rate at which they limit to a maximum will depend on turbine location.
• Conversely, for neural networks, encoding domain expertise is difficult since they are nonparametric; in turn, the inferences (and model constraints) at different levels of fleet granularity are less intuitive.

Contribution
The main contributions of this work are twofold: (i) multi-task learning with hierarchical Bayesian modelling allows information to be shared between distinct (but related) systems using operational fleet data (wind turbines and trucks) The resultant approach permits formal uncertainty quantification at various levels of the predictive model, and, in turn, various granularities of fleet behaviour (e.g.system-specific, condition-specific, or population-wide).Multiple levels of uncertainty quantification enable natural integration with decision processes, or experimental design procedures, considering the whole fleet.In turn, the model can be used to inform fleet interactions within a wider asset management programme.To highlight this novelty, the hierarchical model is integrated with a demonstrative decision process in the second (wind farm) case study.
Similarly, while the proposed hierarchical model makes inferences from observations at the sub-fleet level only (i.e.task-specific outputs) predictions can be made at various levels -including larger groups and the aggregated population.Inference of the joint population model (from task-specific observations) presents the knowledge transfer mechanism.The resultant structure produces both shared and taskspecific models -this is not true for any of the benchmarks, which learn one of the two options (i.e.single-task learning, complete pooling, domain adaptation -Section 4.4).

Hierarchical Bayesian Modelling for Multi-Task Learning with Mixed Effects
Consider fleet data, recorded from a population of engineering systems, which are separated into K groups or sub-fleets.The population data can then be denoted, where y k is target response vector for inputs x k and {x ik , y ik } are the i th pair of observations in group k.
There are N k observations in each group and thus The aim is to learn a set of K predictors, one for each group, related to classification or regression tasks.Without loss of generality, this work focusses on the regression setting, where the tasks satisfy, i.e. the output is determined by evaluating one of K latent functions with additive noise ik .Note, for classification, logistic regression would involve modifying the above likelihood for categorisation (a Bernoulli distribution) and passing f k (x ik ) through the logit function to ensure predictions are between zero and one (binary classification) [16].
The mapping f k is assumed to be correlated between sub-fleets.In consequence, the models should be improved by learning the parameters in a joint inference over the whole population.In machine learning, this is referred to as multi-task learning; in statistics, such data are usually modelled with hierarchical models [49,50].

Hierarchical Bayesian modelling
In practice, while certain sub-fleets might have rich, historical data, others (particularly those recently in operation) will have limited training data.In this setting, learning separate, independent models for each group will lead to unreliable predictions.On the other hand, a single regression of all the data (complete pooling) will result in poor generalisation.Instead, hierarchical models can be used to learn separate models for each group while encouraging task parameters to be correlated [16] -the established theory is summarised here.
Consider K linear regression models, where Φ k = [1, x k ] is the N k × 2 design matrix; α k is the 2 × 1 vector of weights; and the noise vector is N k × 1 and normally distributed 1 k ∼ N 0, σ 2 k I . 1 is a vector of ones, I is the identity matrix, and N(m, s) is the normal distribution with mean m and (co)variance s.The likelihood of the target response vector is then, In a Bayesian manner, one can set a shared hierarchy of prior distributions over the weights (slope and intercept) for the groups k ∈ {1, . . ., K}, In words, (5) assumes that the weights {α k } K k=1 are normally distributed N(•) with mean µ α and covariance 2 diag{σ 2 α }.Similarly, (6) states that the prior expectation of the weights α k is normally distributed with mean m α and covariance diag{s α }; (7) states that the prior deviation of the slope and intercept is inverse-Gamma distributed IG(•) with shape and scale parameters a and b respectively.
Selecting appropriate prior distributions, and their associated hyperparameters {m α , s α , a, b}, is essential to the success of hierarchical models.In this work, prior elicitation is justified by encoding engineering knowledge in each case study as weakly informative priors [15].The Directed Graphical Model (DGM) in Figure 1 visualises the general hierarchical regression.The nodes show observed/latent variables as shaded/non-shaded respectively; arrows show conditional dependencies, and plates show multiple instances of sub-scripted nodes. 1 In this first introductory example, the additive noise variance σ 2 k is observed -in the next example, it is unobserved. 2 The operator diag{a} forms a square diagonal matrix with the elements from a on the main diagonal and zeros elsewhere.
The K weight vectors α k are correlated via the common latent variables {µ α , σ 2 α }; i.e. parent nodes in Figure 1.Note that equations ( 5) to (7) encode prior belief of the independence between latent variables.In this work, this does not restrict the covariance structure of the posterior distribution for {α k } K k=1 since it is approximated using Markov Chain Monte Carlo (MCMC, summarised in Section 4.3).
Via correlations in the posterior distribution, sparse domains borrow statistical strength from those that are data-rich.Crucially, to share information between tasks, the parent nodes {µ α , σ 2 α } must be inferred from the population data.In this way, the sub-fleet parameters α k are (indirectly) influenced by the wider population.Consider that, if {µ α , σ 2 α } were fixed constants, rather than variables inferred from data, each model would be conditionally independent, preventing the flow of information between domains [16].

Mixed-effects modelling
The hierarchical structure allows effects (i.e.interpretable latent variables) to be learnt at different levels, as well as 'prior' information.Specifically, the parameters of the model itself (3) can be learnt at the system, sub-fleet, or population level.The inference of parameters at various levels of hierarchy, while encoding engineering/domain knowledge at each level, constitutes significant novelty here.Returning to the regression example (3), consider that the variance σ 2 k of the noise k is in fact unknown.While one could learn K domain-specific noise variance terms σ 2 k , it is typically assumed that the noise is equivalent across tasks.Sharing the parameter and inferring it from the population can significantly reduce the uncertainty in its prediction.Of course, this assumption should be justified given an understanding of the problem at hand; for example, the same sensing system collects all the population data.In terms of notation, (3) remains the same, however, the domain-specific noise vector k is now distributed k ∼ N 0, σ 2 I .The removal of subscript-k from the noise variance implies that the size of σ 2 remains the same while the number of the sub-fleets K increases (unlike α k ).Intuitively, σ 2 is now a tied parameter [16].
Similarly, it makes sense to also infer effects at the population level, to further reduce model uncer-tainty 3 .Throughout this work, it is assumed that shared effects also enter the model linearly, for the target response vector y k and inputs x k , Where Ψ k is some design matrix of inputs, and β is the corresponding vector of weights.Again, there is no subscript-k for β (like σ 2 ) as it is tied between sub-fleets.Following Kreft and De Leeuw [49], the β coefficients as considered fixed effects, as they are learnt at the population level and shared, while α k are random effects, as they vary between individuals.Intuitively, a model with both fixed and random effects can be considered a mixed (effects) model [15,51].Figure 2 shows the modified DGM of the hierarchical regression.The key differences are nodes outside of the K plate -these are the tied parameters, learnt at the population level.As Gelman et al. [15] point out, the terms random and fixed originate from a frequentist perspective and are somewhat confusing in a Bayesian context where all parameters are random, or (equivalently) fixed with unknown values.The terminology is used, however, as it is intuitive considering engineering applications and consistent with established literature in modelling panel or longitudinal data [50].One should also consider that interpreting mixed-effects models remains challenging, even when models are parametrised.If the effects are not (linearly) independent, the fixed and random coefficients can influence each other, making it difficult to reliably recover their relationships.In turn, the modelling assumptions must be carefully considered when emphasising interpretability.

Inference
In view of graphical models, the observed variables are referred to as evidence nodes.For example, the hierarchical regression in Figure 1 would have the following set of evidence nodes, where [y k ] is shorthand to denote complete set {y 1 , y 2 , . . ., y K }.On the other hand, the latent variables are hidden nodes, Bayesian inference relies on finding the posterior distribution of H given E, i.e. the distribution of the unknown parameters given the data, DGM representations are useful since inference can be aided by graph-theoretic results.The systematic application of graph-theoretic algorithms has led to a number of probabilistic programming languages -here, models are implemented in Stan [52].
The parameters are inferred using MCMC, via the no U-turn implementation of Hamiltonian Monte Carlo [53].Throughout, the burn-in period is 1000 iterations and 2000 iterations are used for inference.Code based on the first case study is publicly available on GitHub4 .

Engineering applications
In each case study, hierarchical models are formulated for knowledge transfer between asset models.The first concerns survival analysis of truck fleets (hazard curves) and the second concerns power prediction for turbines (power curves).Engineering expertise is encoded in a number of ways: to (i) inform prior elicitation, (ii) determine which effects are random or fixed, and (ii) formulate interpretable parameters.In turn, population modelling offers insights as to which sub-fleets share information for which (interpretable) effect.
Importantly, by considering the collected population, the training data can, in effect, be extended.In turn, parameter estimation is improved while increasing the reliability of predictions.There are, of course, important considerations when building such models -prior elicitation, mixed-effects formulation, negative transfer -these concepts are discussed throughout.
Throughout, the predictive performance of the multitask methodology (MTL) is compared to three fleet monitoring benchmarks: • (STL) Single Task Learning: the predictive model learnt from each domain independently.
• (CP) Complete pooling: the predictive model learnt from the collected fleet data, assuming all data are generated by a single task.
• (CRL) Correlation alignment for domain adaptation: sequentially treating each task k as the target domain, and embedding the remaining (source) domains onto the joint distribution p(y k, x k) using CORAL [54].All measurements are treated as one task, and a single model is learnt, to predict the target test data.
For sensible comparisons, the predictive model is consistent across all benchmarks -what differs is the effective presentation of data during inference.Note that parameter interpretation becomes problematic in domain adaptation (CRL) since the (source) joint distributions {p(y k , x k )} K k=1 have been transformed onto the target [36] .Once transformed, making predictions for new source observations is nontrivial.These caveats highlight a benefit of the proposed methodology; however, comparisons to CRL are included to emphasise that adaptation alone is insufficient to treat all fleet monitoring problems, especially with parametrised models and sparse data.
By nature of the practical applications (and data sensitivity) in each case study, validation to a ground truth for parameters is not feasible; for this reason, models are compared to the available (response) ground truth and quantified by the predictive loglikelihood (e.g. ( 22)).

Truck-Fleet Survival Analysis
The hazard data for truck fleet alternators are shown in Figure 3. Herein, this work considers the loghazard, since it is easier to visualise.There are 437 observations in total, split into a 75% training set and 25% test set.The data are z-score normalised in view of data sensitivity and certain (specific) details are omitted.The observations represent the complete monitoring dataset, since no observations we lost via normalisation, truncation, or censorship.It is clarified that normalisation affects the direct interpretation of the parameters.In practice, however, one can recover interpretable values by transforming back into the original space.Here, for the purpose of discussion, the relative parameter values and their relationships remain interpretable.To generate the hazard data, the total time in service for all assets was divided into intervals of one day; for each day, the ratio of the number of components that failed to the number that survived (so far) is calculated.The choice of interval length is dependent on the application -here one day is sufficient compared to the maintenance horizon.
The sub-fleets were manually labelled in collaboration with the engineers at Scania.Colours correspond to different sub-populations, where the total number of groups (and, therefore, hazard functions) is K = 8. (Appendix G post-publication note.)Note that certain domains are more sparse than others, with the most extreme case being k = 8, owning a single observation.The population model will look to utilise data-rich domains with more information (k ∈ {1, 2, 3}) to support the sparse domains (k ∈ {5, 6, 7, 8}).The number of task-wise observations is as follows,

Task regression formulation
When analysing survival data, it is convenient to assume the survival time T is parametrically distributed since the parameters are interpretable and formulate a specific hazard function.A straightforward example is presented when T is exponentially distributed, leading to a constant hazard [55].
Rather than constant, Figure 3 shows the log-hazard is near-linear for a large proportion of the input domain, with a notable nonlinear effect at low t values (early hours in service).Therefore, it is assumed the best (parametric) approximation of the marginal p(T = t) is the Gompertz distribution (G) for each QRUPDOLVHGWLPHLQVHUYLFH QRUPDOLVHGORJKD]DUG sub-fleet [55], This is convenient, since ( 12) is formulated such that log-hazard is linear in time t, Since only hazard data were available, tasks are fit directly to (13) rather than the distribution over the time at failure (12).The correct likelihood, however, should consider the distribution (12) as the tasks directly -this avoids assumptions of a Gaussian likelihood for the log-hazard.Instead, the (log) hazard uncertainty would be naturally represented by the variance of γ and φ.Unfortunately, this was not possible here in view of data availability.For a better interpretation of the parameters in practice, and agreement with Kolmogorov's axioms, the likelihood of the population model should represent the time-at-failure T directly.
Considering the data in Figure 3, a weighted sum of H B-spline bases functions b h (t) is included to model the (non-parametric) discrepancy between the linear Gompertz hazard and the empirical data, Cubic B-splines (Appendix A) are selected as they are smooth with compact support, resulting in a sparse design matrix for the β h terms5 .This property is suitable since the nonlinear response acts in specific (compact) regions of the input.In effect, ( 14) defines a semi-parametric (or a partially-linear) regression [14] with kernel smoothing to approximate the hazard functions for each sub-fleet.

Mixed-effects formulation
From Figure 3, one observes the underlying linear trend {α 1 + α 2 t} is varying between sub-fleets while the nonlinear effect H h=1 β h b h (t) appears consistent over the population.In other words, while the data are poorly described by a (linear) Gompertz hazard function, the (nonparametric) discrepancy remains consistent.
Therefore, the associated spline weights β = {β h } H h=1 are assumed to be fixed effects and learnt at the population level.On the other hand, taskspecific linear weights are inferred, which are correlated via common latent variables (random effects) The mixed effect model can now be expressed in the general notation from (8), Specifically, for each sub-fleet k: y k is the output of the log-hazard ( 14) with additive noise k ; x k are the inputs corresponding to time t; α k is the varying linear weight vector with design matrix Φ k = [1, x k ]; and β is the tied/fixed weight vector, with a design matrix of splines, The resultant graphical model corresponds to Figure 2 and the likelihood of the response is, where θ k = {α k , β, µ α , σ α , σ} is the set of parameters indexed to task k.

Weakly informative priors
Primarily considering α k , it is possible to encode prior knowledge of the expected functions, since the linear component corresponds to a Gompertz survival model (13).It is acknowledged that, in this case, the specific hyperparameter values are less meaningful as the data are normalised; however, their interpretation remains relevant.
Specifically, α k is distributed according to equations ( 5) to (7), with hyperparameters, The first element of m α corresponds to the intercept and postulates the baseline log-hazard 6 .(This is 0 since the data are centred).The second element of m α is the expected slope of the log-hazard.(Set to 1.5 as one expects hazard to increase exponentially under the Gompertz model with a gradient > 1 when normalised).The s α values indicate a weakly informative prior under the ranges imposed by z-score normalisation.Similarly, the a, b values encourage correlation between sub-fleet models, such that the prior mode of the standard deviation of the generating distribution of α k is b/(a + 1) = 1/2 (this intentionally overestimates the deviation σ α between sub-fleets, such that the population model weakly constrains α k ).
The shared prior over the variance of the additive noise k is set to, Whose mode is at 0.2, indicating that the standard deviation of the noise is expected to be significantly less (around five times) than that of the output, i.e. a high signal-to-noise ratio. 6Or the exponentiated initial rate-of-failure.
Following a standard approach [15] the basis function model can be centred around the linear component (log λ G (t)) via specification of the β prior.Specifically, one can postulate a shrinkage prior with a high density at zero, to (effectively) exclude basis functions by encouraging their expected posterior weights to be near-zero -while also having heavy tails to avoid over-shrinkage.A standard hierarchical prior is used [56] which exhibits these desired properties, where v is some small nonzero value -in this case v = 10 −3 .
To summarise, without any data, the prior postulates that the underlying log-hazard is expected to be linear, corresponding to a Gompertz survival model (13).The discrepancy between this simple (parametrised) behaviour and the data will be modelled by nonparametric splines, resulting in a semiparametric regression ( 14) for each task.Figure 4 visualises the implications of the model and prior, which shows the posterior predictive distribution inferred from the most data-rich domain only (k = 1, single-task learning).This experiment is used to validate an appropriate number of splines for the population model, which is found to be H = 5 through 20-fold cross-validation, presented in Appendix B.
It is intuitive to note, the same independence can, in effect, be achieved for parameters with hierarchical priors (i.e.α k ) by letting the variance of their generating distribution become very large [34] (i.e.σ α → ∞).
Having conditioned on the training data [y k ], predictions can be made for the unobserved response y * k at x * k using the posterior predictive distribution, QRUPDOLVHGWLPHLQVHUYLFH QRUPDOLVHGORJKD]DUG (Conditioning on x * k is included here to emphasise prediction.)

Results
To motivate sharing information within the fleet, the regression tasks for each sub-fleet are initially learnt independently.This corresponds to learning separate (task-specific) parameters, which are independent, preventing the flow of information via correlated variables or tied parameters.The separated models can be visualised by removing the K plate from the DGM in Figure 2, while including k-subscripts for σ 2 and β. Figure 5 presents these updates.Figure 6 shows the resulting domain-wise regression (i.e.single task learning).The posteriorpredictive distributions p(y * k |x * k , x k , y k ) make sense under the model/prior formulation, however, independent models fail to consider that valuable information might be shared between the task relationships.In turn, the posterior predictive distribution presents large uncertainty, especially in sparse domains.
Hierarchical modelling is now utilised to learn the parameters in a combined inference from the population data.The mean and standard deviation of samples drawn from the multi-task learning posterior predictive distribution are shown in Figure 7. Visually, the predictive distributions p(y * k |x * k , {x k , y k } K k=1 ) better represent belief of the underlying task functions by leveraging information between domains.In particular, information from data-rich domains (k ∈ {1, 2, 3, 4}) informs the (fixed) nonlinear effect.
The predictive (log) likelihood for out-of-sample test data (25%) is evaluated for a large number of trials (100) via bootstrap sampling [16].The combined population log-likelihood L increases significantly, from 355 to 410, highlighting improvements following inference at the fleet level.Table 1 presents the relative changes for each task, where STL is single-task learning, MTL is multi-task learning, CP is complete pooling, and CRL is CORAL for joint adaptation 7 .Compared to STL there is a relative improvement in all domains (other than k = 7) especially those domains with sparse training data.In particular, leveraging information enables more reliable extrapolation to late hours in service where the test data are likely to be sparse.It is believed the likelihood decrease occurs in domain k = 7, since the sub-fleet labelling may be unreliable -the hazard data could in fact represent more than one group when observing Figure 3. Improvements to the labelling procedure are discussed as future work, Section 7.
Complete pooling (CP) and Correlation alignment (CRL) benchmarks behave as expected.CP presents the lowest overall log-likelihood L, which makes sense considering the disparity between tasks.CRL successfully improves from CP by transforming the source data (all remaining domains) into the target k, especially when k = 7.However, the total likelihood remains lower than STL, which indicates a high risk of negative transfer -in fact, CRL improves predictions in only k = {6, 7}.
Reductions in the posterior variance of the parameters via multi-task learning are also considered, compared to single-task learning.Figures 8 and 9 k = 1  show the posterior distribution of the slope and intercepts respectively: these parameters correspond to the random (linear) effect of the Gompertz model α 1 + α 2 t (13).Variance reductions are most significant in sparse domains (bottom row) and less significant in the data-rich domains (top row).This follows intuition since the population model allows sparse domains to borrow information via the shared parent nodes {µ α , σ 2 α } while the data-rich domains are largely unaffected.Quantitatively, the average reduction in standard deviation for the (interpretable) linear weights is 90% and 73% for the slopes and intercepts respectively.
Figure 10 shows the posterior distribution of the fixed weights β.Under the prior specification, these  weights adaptively deviate from zero to model the discrepancy from the linear effect in sparse/compact regions of the input (via nonparametric splines).Building on intuition, by tying these parameters, the expected values shift towards the expectation of the data-rich, independent models (k ∈ {1, 2, 3, 4}).In other words, in the population-level inference, the fixed effect is learnt from the domains which have data to describe it.
Likewise, Figure 11 shows improvements in the estimate of σ (k) when tying the noise effect.The posterior variance is reduced, while the expected values indicate a lower noise variance.This should be expected since by pooling the data to learn σ (k) the training set is effectively extended; in turn, the posterior moves further away from the weakly informative prior (19).

Modelling additional failures and the risk of negative transfer
The assumptions which select the tied parameters are critical -this caveat is widely acknowledged.If any assumptions prove inappropriate or non-general, the multi-task learner can risk negative transfer, whereby predictions are worse than conventional (i.e.single-task) learning -i.e. in this case, independent models.In a probabilistic setting, negative 1 for alternator components.Independent models (hollow) compared to population-level modelling (shaded).for alternator components.Independent models (hollow) compared to population-level modelling (shaded).
transfer manifests as inappropriate inter-task correlations; to control these dependencies one could utilise shrinkage [50] or automatic relevance determination [56] (between tasks) to protect against such issues; these ideas are suggested for future work.
To highlight concerns of negative transfer, the empirical hazard data are considered from another component in the same fleet of vehicles, turbochargers.The survival data are presented Figure 12, which are calculated following the same procedure as the alternators.Critically, manually labelling the alternator data is problematic, since it becomes infeasible to categorise observations as the generating functions become more compact, or towards the end of the operational life.The associated unlabelled data are highlighted with small • markers in Figure 12 (Appendix G post-publication note).
There are various options when considering these data.One could treat the observations as a single (pooled) sub-fleet or task, with a large expected variance; alternatively, the labels themselves could be treated as an additional latent variable, such that categorisation into task groups is unsupervised.Here, the unlabelled data are removed during preprocessing, since modelling them is out of the scope of this work; alternative solutions are proposed in the concluding remarks, Section 7. The resulting turbocharger dataset has 287 (normalised) observations over six tasks, such that k ∈ {1, 2, . . ., 6}, and the number of observations in each domain is as follows, Comparison between the tied populationlevel parameters (grey shaded) and independent models (hollow) for each domain k ∈ {1, . . ., 8}.Zoomed sections for h = 1 and 5 are provided in Appendix D. 32 28 25 30 287 As before, the data are split into 75% training and 25% test sets.
From Figure 12, one observes that the turbocharger hazard data are similar to Figure 3 (alternators).
Since the components operate within the same fleet of vehicles, it is assumed that information can be shared between the associated predictors by extending the task-set in the hierarchical model.A naïve approach assumes the same formulation of mixed effects, and simply extends the total number of tasks such that K = 14 (i.e. 8 + 6) then infers the parameters from both alternator and turbocharger hazard data.Appendix E presents the posterior predictive distribution of such a model.While the model interpolates well, the extrapolation behaviour 8 is 8 At the population level, this is not extrapolation, since the response at late hours in service is learnt from the alternator domain.Comparison between the tied population-level parameters (grey shaded) and independent models (hollow) for each domain k ∈ {1, . . ., 8}.
problematic for later hours in service.This is because the model assumes that the discrepancy (from the Gompertz model) is equivalent for both components, as the nonparametric weights β remain tied over all tasks.The unlabelled data are evidence that this assumption is inappropriate, as the model would generalise poorly to these data, plotted in Appendix E. The resultant model would have a high risk of negative transfer.
Instead, the mixed effect model is reformulated, whereby a separate, nonparametric discrepancy {β l } L l=1 is learnt for the alternator (l = 1) and turbocharger (l = 2) tasks -introducing two higherlevel subgroups, such that L = 2.As before, the parameters of the linear component remain correlated via the shared parent nodes, allowing knowledge transfer between all 14 tasks (both alternators and turbochargers).In turn, the model and prior now postulate a varying underlying linear trend for all tasks (the Gompertz model); however, the discrepancy from this behaviour is component-specific (a separate β l for each component).The modifications can be visualised by updating the DGM from Figure 2 to include higher-level subgroups l ∈ {1, 2}, presented in Figure 13, where l = 1 alternators or l = 2 turbochargers.
A key difference is the new L-plate and the associated subscripts: K l is the number of sub-fleets for each component, such that K 1 = 8 (alternators) or K 2 = 6 (turbochargers); while β l indicates a separate (independent) weight vector for each com- Figure 13: DGM of hierarchical linear regression with mixed effects.Introducing a higher-level group, such that the total number of tasks is L × K l .
ponent.The collected tasks become, In turn, the likelihood of the response is modified, where θ k,l = {α k,l , β l , µ α , σ α , σ} is the parameter set indexed to group k and component l. Figure 14 plots the mean and standard deviation of samples drawn from the posterior distribution of the extended population model (compared to independent turbocharger models).By specifying componentspecific weights β l the representation of uncertainty improves when extrapolating in the turbocharger domain.Reductions in the posterior predictive distribution are also observed p(y * kl |x * kl ) (ignoring other conditionals) for alternator tasks (l = 1) since the population data have been extended for the linear component.Likewise, variance reductions are observed in the posterior distributions of the intercept and slope, visualised in Appendix F. Quantitatively, the average reduction in standard deviation for the (interpretable) linear weights is 51% and 67% for the (turbocharger) slopes and intercepts respectively.
Fleet-level inference improves the (bootstrapped) predictive log-likelihood from 570 to 646, compared to single-task learning (STL), highlighting improvements in predictive capability for the combined fleet over both components.The task-wise predictive likelihood is presented in Table 2 for the alternator (l = 1) and turbocharger (l = 2) domains, compared to the same benchmarks.Note, however, that the likelihood fails to increase from STL for certain alternator tasks (k = 1 or 5) reiterating the risk of negative transfer in the extended model.Ideally, the dataset should be much larger to determine if negative transfer has occurred and whether the current assumptions are appropriate.As before, while correlation alignment (CRL) improves on complete pooling (CP) the adaption approach is not suitable for the task set, and predictions remain worse than STL.The sparsity of measurements prohibits reliable transformations of the source data into the target domain.).Here l corresponds to the component label (alternator l = 1 or turbocharger l = 2) while k is the sub-fleet label for each component.(The complete log-likelihood considers all groups and components L.)    mation between the sub-fleet (k) or component (l) groups.The heat-map corresponds to the Pearson correlation coefficient of the posterior distribution between variables that share parent nodes in the graphical model (i.e.α kl ) -these correlations enable multi-task learning.Intuitively, Figure 15a shows increased correlation between the intercepts of the same component, with two clear blocks of 8 × 8 (alternators) and 6 × 6 (turbochargers).The intercept correlation structure is interpretable since components of the same type are likely to have a correlated baseline hazard.
The slope correlation structure in Figure 15b is more descriptive.In the top left block, the alternators are less correlated as domains become more sparse (from 1 → 8); this makes sense since the level of correlation is reduced where there are fewer data to support task correlation.The effect is most obvious for k = 8 (alternators) which only has a single training point.In both Figures 15a and 15b, the structured covariance of α kl highlights how intertask correlation contributes to variance reduction in the fleet-model.

Practical implications
In the field, use-type labels could be used to make sub-fleet (rather than global) predictions, which has major implications when informing efficiency or safety-critical interactions with the fleet.For example, task-specific estimations of remaining useful life would be associated with less uncertainty, and the hierarchical model allows both population estimates (from the generating distributions) and task-specific estimates.These multilevel predictions present a key contribution of this work; in turn, a multilevel decision process could be designed for more reliable interactions with the fleet -such as vehicle servicing or component replacement.A hypothetical decision process is demonstrated in the next case study.

Wind Farm Power Prediction
To demonstrate the wide applicability of hierarchical models, power prediction is presented for a wind farm case study.Figure 16 shows power curve data, including curtailments, provided by Visualwind and recorded from three operational turbines.The turbines are the same make and model but in different locations.As before, the data are normalised in view of data sensitivity and certain (specific) details are omitted -the same comments regarding interpretability, data truncation, and censorship apply.The work in [57] demonstrates a suitable method to represent similar normal and curtailed functions in a combined model; however, each function f k is assumed independent -in turn, there is no knowledge transfer between task parameters.Here, knowledge transfer is enabled by correlating the regression models in a hierarchical formulation.There are 10,581 observations in total.The data were labelled in weekly subsets, according to turbine k ∈ {1, 2, 3} and operational condition (normal or curtailed) l ∈ {1, 2}.Each point corresponds to a 10-minute average of power y ikl and wind speed x ikl .
The first turbine has 2 weeks of data, the second has 4 weeks, and the third has 11.5 weeks.Missing values and very sparse outliers were removed from the dataset (using the local outlier factor algorithm [58]).Since the first turbine presents a normal power curve only (l = 1) there is a total of five tasks, As before, specific tasks have less data than others, with the number of observations per group is as follows, The splits are intentionally inconsistent, to allow a combined inference to leverage information from the data-rich tasks (with historical data) to support sparse tasks (systems recently in operation).In particular, referring to Figure 16: the normal data from the first turbine (k = 1, l = 1: dark blue) should support the sparse normal tasks (k ∈ {2, 3}: dark orange and green); while the data-rich curtailment from the third turbine (light green) should support the curtailed relationship of the second turbine (light orange).

Task regression formulation
A standard power curve model assumes segmented linear regression [23].A similar formulation is adopted here, Although simple, (25) presents interpretable parameters -visualised in Appendix C. p is the cut-in speed and r is the rated speed (for normal operation); the change-point q corresponds to the initiation of the limit to maximum power P m (where p < q < r).The gradients m 1 and m 2 approximate the near-linear response between p-q and q-r respectively.The second change point and gradient {q, m 2 } enable soft curtailments, rather than a hard-limit at maximum power P m .

Mixed-effects and prior formulation
From knowledge of turbine operation, the expected power before cut-in should be zero for all turbines (i.e. a fixed effect).The cut-in speed p is also tied as a fixed effect and learnt at the population level since all turbines have the same design.Similarly, the max power P m is tied between operational labels l ∈ {1, 2} such that one parameter is learnt for the normal tasks (l = 1) and one for the curtailed tasks (l = 2).Conversely, the change-points {q, r} and gradients {m 1 , m 2 } are assumed to be correlated between all tasks, i.e. correlated via shared parent nodes.In turn, one would expect the curtailed relationships (l = 2) to be more correlated (and share more information) than the normal relationships (l = 1) and vice versa.
where the fixed effects are green and the random effects are purple.Each segment of the regression could be presented in a similar formulation to (23) such that each component is a standard varying intercepts/slope model [15].Matrix notation is avoided, however, to present the model (and priors) around parameters {P m , m 1 , m 2 , p, q, r}.
The likelihood of the response can be specified using (27), where 1 , p, q (kl) , r (kl) } is the parameter set indexed to turbine k and curtailment l.
Given their interpretability, weakly informative priors are postulated for each parameter.For the change points, µ p ∼ N(.2, .5),µ q ∼ N(.4,.5),µ r ∼ N(.6, .5) These priors reflect that change points are expected to occur at regular intervals across the input domain with relatively high variance (relative to a normalised scale).The priors for gradient and maximum power are, P (1)  m ∼ N(1, .1),P (2)  m ∼ N(.8, .1) These distributions postulate the expected gradient m 2 in a normalised space; unit max power P m for normal operation; and a typical 80% curtailment [57] for the limited output P (2) m .No prior is required for m 2 since it is specified by {P m , m 1 , p, q, r} in (27).As with the truck-fleet example, the IG(1, 1) distributions weakly encourage inter-task correlations, such that the prior intentionally overestimates the deviation between task parameters.Similarly, the posterior can be specified using (21), where p([y k ] | Θ) is indexed by (28) and the joint prior p(Θ) is defined using (29) to (31).As before, this is intractable and inferred with MCMC.

Results
Figure 17 shows posterior predictive distribution from fleet-level inference -compared to independent STL models, plotted with light shading.Intuitively, variance reduction is most obvious for sparse or poorly described domains (orange and dark green).There is an overall increase in the predictive likelihood when fleet modelling, compared to single-task learning, from 8229 to 8258.Table 3 quantifies changes in task-wise predictions compared to the ).Here l corresponds to the operating condition (normal l = 1, or curtailed l = 2) while k is turbine identifier.benchmarks: there is a likelihood increase in all domains other than (k = 2, l = 1) and (k = 3, l = 2).It is believed that reductions occur since the model is constrained such that, to maximise the overall likelihood, the performance in data-rich domains is reduced in a trade-off.In other words, the prior belief is best suited to data-rich tasks -when the prior becomes more informed by data, it becomes less suitable in data-rich domains; instead, the prior represents the population.(Consider that the overall likelihood L increases, despite task-wise fluctuations.)To combat this, uninformative priors should be considered [15]; these are discussed in Section 7.
Correlation alignment (CRL) performs less competitively in the wind turbine example since the measurement distributions shift significantly between each task, training, and testing (testing data correspond to following weeks).In particular, when the source data represent a more complete power curve, the alignment with sparse domains becomes partial, and CRL can produce unreasonable embeddings.
Figure 18 shows the posterior distribution of the parameters inferred at the independent and fleet level.The cut-in speed q moves towards an average of the independent models, with reduced variance; this should be expected since q becomes tied as a population estimate.The change points q cluster intuitively, such that the normal and curtailed tasks form two groups (dark and light shades).The estimated r parameters are significantly improved through partial pooling -in particular, the green and orange domains shift much further from the weakly informative prior.There is a notable reduction in the variance across all tasks for the slope estimate m 2 .The average reduction in standard deviation across these parameters is 25%.moves toward an average of the relevant tasks (where l = 2).In both operating conditions, parameter tying enables the move from vague posteriors to distributions with clear expected values.The average reduction in standard deviation for the normal maximum is 82%, alongside 37% for the curtailed maximum.
Finally, Figure 20 plots the Pearson correlation coefficient of the pair-wise conditionals of q between tasks.(q is presented since it is the most structured/insightful.)It is clear that, by moving to a hierarchical model, the correlation between related tasks is appropriately captured, with two distinct blocks associated with the normal and curtailed groups.

Practical implications: Decision analysis
In practice, probabilistic predictions from the power model can be used to support decisions at any level of the hierarchy, including the population level.For example, population-level decisions are useful if the operator does not wish to commit to interacting with a specific turbine.
Consider a decision problem, whereby an operator must commit to delivering a minimum power in some upcoming time window.This involves decision making under uncertainty, and the formal (statistical) procedure to identify the expected optimal action requires a probabilistic quantification of wind speed and power output.The latter can be achieved by sampling from the posterior predictive distribution at the population level, i.e. p(y * | x * , θ l ), where 1 , p, q (l) , r (l) } is sampled directly from the generating distributions.Figure 21 is an example of such a prediction for a given wind speed.
Predictions at this level of the hierarchy are useful since they assume the operator cannot commit to a specific turbine (at this stage).Such predictions would not be available from domain-specific (independent) models; conversely, complete pooling (or domain adapted) predictions would not formally consider the additional variability associated with non-specific turbine identity.
In this example, the operator has three options, each associated with a payout (positive utility) upon successfully delivery of power and a penalty fine (negative utility) if the turbine generates insufficient power -these values are presented in Table 4  prior probabilistic model of (normalised) wind speed x pr is shown in Figure 22, as described by, (In practice, this information would likely come from a forecasting model.) Figure 23 shows the decision-event tree representation of the problem.Here, the square (decision) node P L is associated with the available power commitments in Table 4, such that P L = {L 0 , L 1 , L 2 }.
The circular (probabilistic) node (y * | x pr ) is the probabilistic prediction of power, given the prior model of wind speed.Finally, the triangular (utility) node shows the expected consequence of the E[penalty P L ] = P (y * < P L ) × penaltyfine P L (36) This information can then be used to rank decision alternatives [59].For example, in the prior decision tree (Figure 23) the path associated with the highest power level L 2 is optimal (i.e.P * L = L 2 ) -this was found to have the highest expected utility of 0.33, compared with 0.0 for L 0 , and 0.246 for L 1 .
A further application quantifies the expected value of data collection activities.Figure 24 extends the problem in Figure 23 to include another decision M : whether to measure wind speed (m) or not ( m).In the case where measurements are taken, predictions can be made using the new data x m .A so-called preposterior decision analysis [60,61] can be completed, by sampling from the prior model to generate hypothetical measurements.
When assuming perfect data, whereby each measurement removes all uncertainty from wind speed (32), Here, the VoPI is 0.236.The results are presented in Figure 25, which shows a histogram of expected utilities associated with each of the hypothesised, perfect measurements (samples from the prior model).The mean of these values E[u prepost ] is labelled next to the dotted line.The expected utility without the data E[u pr ] is labelled as a dashed line, and the difference (37) is the expected value of the data.
To summarise, hierarchical Bayesian modelling has provided a full quantification of uncertainty, reflective of different asset subgroups and classes.In turn, the model enables a formal downstream analysis of variable interactions and integration with a utility-based decision process (demonstrated here).The implications are significant since various concepts can be quantified; for example, the expected optimal action or the value of data collection activities.

Concluding Remarks
Hierarchical Bayesian modelling with mixed effects is demonstrated as an effective method of sharing information between models of fleets of assets in engineering.Parameter estimation and predictive capabilities are improved (for the combined fleet) in two case studies, utilising the same flexible multitask learning framework.Important considerations are discussed when formulating each population model: prior elicitation, mixed-effects formulation, and negative transfer -these concepts are critical to the success of population-level inference.
The proposed hierarchical methodology is desirable since it enables downstream analyses of the fleet model.The method is used to determine which asset models are correlated for which interpretable parameter, at various groupings (e.g.operating condition, system-specific, population-wide).The multivariate (and multilevel) uncertainty quantification enabled by the model is then propagated through a demonstrative decision analysis for the second case study, to consistently and coherently identify expected optimal actions.The expected value of data collection is also quantified, in the context of the decision problem and the underlying model.
The first application concerns the survival analysis of turbocharger and alternator components in an operational fleet of trucks (maintained by Scania).A semi-parametric hazard curve model is improved through partial pooling and parameter tying (15% and 13% increases in predictive log-likelihood) where selected parameters are inferred at the populationlevel, rather than vehicle subgroups.The method builds on engineering intuition since correlations in the hierarchy can be inspected to determine which groups of vehicles or components are correlated for which effects in the survival model (i.e.interpretable parameters).
The second study presents power prediction for a group of wind turbines.The SCADA monitoring data were provided by Visualwind, measured from the same model of turbine in different locations.Correlated power curve models are learnt as a segmented (piece-wise) linear regression, described by interpretable parameters.By moving to a population-level inference, parameter estimation is improved, as well as model generalisation (for the combined population estimates).In particular, the estimation of maximum power is significantly improved for turbines with fewer data and recently in operation (up to 82% reduction in the standard deviation of maximum output prediction).
The success of these models depends on the reliability of the domain knowledge encoded in the prior distributions.In this case, priors were postulated as weakly informative, since interpretable parameters and domain expertise allowed sensible prior elici-tation.In turn, an appropriate level of knowledge transfer could be determined automatically, given the model and the data, reducing the risk of negative transfer.When such elicitation is infeasible, future work should consider the use of uninformative priors [62], especially for the (variance) parameters that control the level of correlation between tasks.
Future work should consider an objective method to categorise sub-fleet data in a practical setting; this might include clustering assets from specification or operations data.The labelling of data into distinct tasks can be non-trivial in an engineering setting and requires investigation.Finally, extending the multilevel model to capture parameter relationships over the fleet should prove insightful; for example, if the coefficients of the power model were regressed on spatial/temporal inputs for the wind farm, one could simulate (sample) more varied hypothetical members of the population at certain locations or timescales.

Figure 2 :
Figure 2: DGM of hierarchical linear regression with mixed effects.

Figure 3 :
Figure 3: Log hazard function data for alternators in the truck fleet.Training and testing markers are • and • respectively.Colours correspond to sub-fleet labels, associated with the task index k ∈ {1, 2, . . ., 8}. (Appendix G post-publication note)

Figure 4 :
Figure 4: Basis function model for the data-rich domain (k = 1).The parametric Gompertz component (13) is the dashed line and the posterior mean of the semi-parametric model(14), including splines, is the solid line.

Figure 6 :
Figure 6: Posterior predictive distribution p(y * k |x * k , x k , y k ): the mean and three-sigma deviation for K independent regression models.

Figure 7 :
Figure 7: Posterior predictive distribution p(y * k |x * k , {x k , y k } K k=1 ): the mean and three-sigma deviation for multitask learning with mixed effects.

Figure 8 :
Figure 8: Variance reduction in the posterior distribution of the intercept parameters α (k) 1

Figure 9 :
Figure 9: Variance reduction in the posterior distribution of the slope parameters α (k) 2

Figure 12 :
Figure 12: Log hazard data for turbochargers in the truck fleet.Training and testing markers are • and • respectively.Colours correspond to sub-fleet labels (Appendix G postpublication note.)

Figure 15
Figure 15 is insightful since it informs which correlations in the hierarchy transfer or share infor-

Figure 14 :
Figure 14: Posterior predictive distribution, the mean and three-sigma deviation for: (top) K independent regression models of turbocharger hazard p(y * kl |x * kl , x kl , y kl ).(bottom) multitask learning with mixed effects for all turbocharger and alternator tasks p(y * kl |x * kl , {{x kl , y kl } K l k=1 } L l=1 ).

Figure 15 :
Figure 15: Pearson correlation coefficient of the conditional posterior distribution for the linear coefficients α k (slopes and intercepts).Purple lines separate the alternator tasks (up to 8) and turbocharger tasks (up to 6).

Figure 16 :
Figure 16: Power-curve data from three k ∈ {1, 2, 3} wind turbines of the same make and model.Relationships correspond to normal l = 1 and ideal l = 2 operation.
K l k=1 normal (l = 1) 1075 1869 5845 8789 curtailed (l = 2) -637 1155 1792 The proportions of training data are listed below.The observations remain ordered to test generalisation to measurements from later operational life.

Figure 19
Figure 19 presents insights relating to maximum power estimates P m .The tied parameter for the normal maximum P (k,1) m moves toward the data-

1 Figure 18 :Figure 19 :
Figure18: Changes in the posterior distribution: the cut-in speed p, initiation of curtailment q, rated speed r, and linear slope m1.Independent models (hollow) compared to population-level modelling (shaded).When the parameter is tied (or fixed) the distributions are black.

Figure 20 :
Figure 20: Pearson correlation coefficient of the conditional posterior distribution for the rated wind speed q, tick labels correspond to (k, l).Purple lines separate the normal (l = 1) from the curtailed task-parameters (l = 2).

Figure 21 :
Figure 21: Samples from the posterior predictive distribution of normalised power, at an arbitrary input of normalised wind speed = 0.5.

Figure 22 :Figure 23 :
Figure 22: The prior distribution of normalised wind speed.

Figure 24 :
Figure 24: Decision-event tree representation of the value of information analysis.

Figure 25 :
Figure 25: Expected utilities associated with hypothesised wind speed measurements.The expected VoPI is shown as the difference between expected utilities with (E[uprepost]) and without (E[uprior]) the wind speed data.

Table 4 :
Financial outcomes of decision analysis.