Forced vital capacity trajectories in patients with idiopathic pulmonary fibrosis: a secondary analysis of a multicentre, prospective, observational cohort

]), followed by cluster 2 (4·74 years [3·96–5·73]), and was longest in cluster 4 (5·56 years [5·18–6·62]). Baseline FEV 1 to FVC ratio and concentrations of the biomarker SP-D were significantly higher in clusters 1 and 3. Similar lung function clusters with some shared anthropometric features were identified in the replication cohort.


Introduction
Idiopathic pulmonary fibrosis is a chronic respiratory disease, characterised by progressive lung scarring and loss of lung function. 1 The prognosis is poor, with a median survival of 3-5 years. 2 However, the progression of disease is variable, with some patients showing stable lung function over time, whereas others progress rapidly or experience episodes of acute deterioration. 2, 3 Change in forced vital capacity (FVC) is an accepted marker of disease progression in patients with idiopathic pulmonary fibrosis. [4][5][6][7] Identifying and characterising pulmonary function trajectories soon after diagnosis 3 is crucial for establishing prognosis, making clinical management decisions, 2,3,6,7 and interpreting results from interventional clinical trials. 2, 6,7 Evaluation of disease progression in clinical trials and observational studies in patients with idiopathic pulmonary fibrosis is often hampered by missing data on lung function, [6][7][8][9] affecting the power and accuracy of statistical models for assessing decline of lung function. 4,5,10,11 As idiopathic pulmonary fibrosis progresses, missed spiro metry visits promote survivor bias by raising the mean FVC because missing values are associated with exacer bation of the condition or mortality among patients. [6][7][8] To mitigate this bias, previous studies have used various methods to adjust for data loss. [4][5][6][7]10,11 However, these approaches can introduce alternative biases, making it difficult to accurately measure and model the disease trajectory of idiopathic pulmonary fibrosis over extended time periods. [6][7][8][9]12,13 Machine learning algorithms can overcome some assumptions and might mitigate biases induced by other imputation methods. 14,15 Missing data remain an issue for machine learning tools; however, additional mathematical techniques can estimate numerous possible outcomes by resampling the underlying distributions thousands of times 14,15 to generate enhanced synthetic datasets, which can be used to train machine learning algorithms 16 to operate as an imputation tool. 17,18 We aimed to enhance the power of a longitudinal cohort of patients with incident idiopathic pulmonary fibrosis through imputation of data on lung function to estimate FVC loss due to disease progression. Subsequently, we applied unsupervised selforganising maps (SOMs) to identify distinct clusters of disease trajectories among patients with idiopathic pulmonary fibrosis, which could inform disease management and improve the efficacy of clinical trials.

Study design
We did a secondary analysis of longitudinal data on FVC collected from a cohort of patients with idiopathic pulmonary fibrosis from the PROFILE study; 19 a multicentre, prospective, observational cohort study. We also performed a replication of the analysis on an inde pendent dataset (ie, the replication cohort), obtained from the Chicago Consortium, which included longitudinal FVC measures obtained from the UUS study (in the USA, the UK, and Spain) collected by the University of Chicago (Chicago, IL, USA). 20 The PROFILE study and the replication cohort have been described previously (appendix p 1). 19,20 Data analysis Imputation methods were chosen on the basis of the data being continuous rather than categorical, and following literature review. 6,7,[13][14][15]18 Methods included simple

Research in context
Evidence before this study We searched PubMed without date or language restrictions from date of database inception to Dec 9, 2021, using the following search terms: "idiopathic pulmonary fibrosis", "lung function", and "clustering" or "imputation". We identified three studies that had performed a cluster analysis in retrospective or registry collections to identify clusters of patients with interstitial lung disease on the basis of various clinical features, including lung function. No previous study had combined data-driven and self-organising algorithms to understand the nature of long-term lung function trajectories in patients with idiopathic pulmonary fibrosis. Among studies that had imputed missing data on lung function, these data were often attributable to a linear change or population average, and no studies had assessed several methods for imputing missing data. No previous study used unsupervised self-organising maps to model and interpret long-term lung function to evaluate the presence of distinct lung function trajectories and how they associate with clinical outcomes.

Added value of this study
To our knowledge, this is the first study to identify and validate lung function trajectories with a two-stage machine learning approach, including both supervised and unsupervised approaches, in a long-term prospective observational cohort of patients with incident idiopathic pulmonary fibrosis. Using a Markov Chain Monte-Carlo simulation approach, we were able to overcome the challenges associated with low statistical power due to missing data, often in cases where disease severity was a barrier to lung function testing. We performed an extensive series of internal sensitivity and validity analyses, as well as external replication, to provide robust conclusions. Our analyses showed that a model-based cluster analysis was able to find four discrete trajectories of longitudinal lung function in patients with idiopathic pulmonary fibrosis. These clusters were associated with distinct clinical and biochemical features that might have important implications for clinical management. Our machine learning analysis showed that two-thirds of patients followed a typically observed disease trajectory comprising a steady initial decline in lung function; however, a third of patients showed alternative trajectories with either improved or stable lung function overtime, complicating the interpretation of widely used endpoints in patients with idiopathic pulmonary fibrosis. Importantly, these clusters were associated with an improved prognosis in the cohort of treatment-naive patients with pulmonary fibrosis.

Implications of all the available evidence
Our findings on the different lung function trajectories in patients with idiopathic pulmonary fibrosis could have major implications for research and patient care. Our imputation models provide valuable comparisons that can support evaluation of endpoints from data with non-random missingness. Stratification of patients by lung function cluster would support the design of clinical trials and effective randomisation to support the assessment of treatment-related effects and to minimise confounding of natural disease trajectories. Similarly, understanding the natural history of lung function in treatment-naive patients might help to inform their prognosis in the medium and long term on the basis of shortterm changes in lung function.
See Online for appendix interpolation of missing values, including conventional linear regression, last observation carried forward, and 10% annual reduction in percentage predicted FVC (-10% decline per year), 6,7,13 as well as machine learning approaches, including random forest 15 and knearest neighbours 14 classifiers 16 capable of dealing with non linear data and data that are not normally distributed. 14,15,18 Due to the longitudinal connectivity between the spirometric visits related to a patient, all imputations were performed as consecutive chained equations. 18 For testing imputation methods, we used the complete dataset, consisting of 82 patients who completed all six spirometric visits (appendix p 13), split into learning datasets (57 [70%]) and test datasets (25 [30%]). Internal tenfold crossvalidation was used to optimise machine learning models. Synthetic simulation of missing data was conducted by removing data randomly from the test dataset, in proportion to the distribution of the occurrence of missing spirometric appointments in the whole PROFILE cohort. The lowest normalised root mean squared deviation (NRMSD) from separate models was used to assess the reliability of imputation. This index is used to measure the differences between values predicted by a model and observed values. The NRMSD represents the square root of the differences between predicted and observed values divided by the SD of the observed values. 14,15 To minimise survival bias and to increase statistical power, we did an analysis that included imputed values at all timepoints, regardless of the reason for missingness, including death. Based on the results of the complete dataset, we built a continuous autoregressive model. 17 Integrating this model into Markov Chain Monte Carlo (MCMC) allowed incorporation of stochastic volatility over time, simulating events not experienced by patients in the complete dataset, such as abrupt FVC decreases or below mean FVC values preceding patient death, 11,17 which we termed the naive dataset. To mitigate against residual survival bias in this naive dataset, we generated a further theoretical dataset (10 000 simulations each). In this dataset, we substituted 41·7% dummy values (including FVC=0) into the naive dataset and distributed these values proportionally to the mortality rate observed from the first year to the third year in the PROFILE study. We assessed the sensitivity of these imputation approaches by comparing NRMSD values across all spirometric visits.
We performed the unsupervised cluster analysis using SOMs. As a preprocessing step, we normalised the data by centralisation and scaling, which transformed the data into scalefree values. We performed hyperparameter optimisation before clustering. The SOM network was trained for the corresponding dataset for 200 iterations to minimise quantisation error. The learning rates started from 1·00 and was set to 0·90 (ordering) and to 0·02 (tuning), and a neighbourhood distance was set at 1·00 with hexagonal topology. 21 Due to algorithmic similarities between kmeans and SOMs, we used the Elbow method to identify the optimal number of clusters in our datasets. The validity (or stability) of each cluster was assessed by Jaccard indices after the sensitivity analysis. The minimum threshold for cluster stability by Jaccard indices was set at 50%. 21 We performed three additional sets of sensitivity analyses on the generated clusters. First, clusters were generated by use of 3 years of spirometry data from the following datasets: the complete PROFILE dataset, the complete PROFILE dataset excluding patients with data missing due to death, and data from patients who completed all spirometric visits without imputation. The second sensitivity test analysed the clusters generated by use of spirometry data from baseline to the first year, baseline to the second year, baseline to the third year, and from patients who completed all six spirometric visits. Theses analyses were performed in the same way in the replication cohort. The final sensitivity test included the cluster generation by kmeans on the PROFILE dataset.
Serum biomarkers were measured from samples that were prospectively collected at baseline and analysed as previously described (appendix pp 1-2). 19

Statistical analysis
We implemented a workflow using opensource packages from the R project (version 4.1.1). Scripts are deposited online. To evaluate associations between lung function and disease trajectory between clusters, we applied a mixedeffects linear model with repeated measures analysis of annual rate of change in FVC. We performed the mortality risk assessment between clusters using hazard ratios (based on the Cox proportional hazards model), KaplanMeier plots, and logrank tests. Survival probability at any particular timepoint was calculated by the formula: ([number of participants living at the startnumber of participants who died] / number of participants living at the start).
Estimates for the Cox proportional hazards model and mixedeffects linear model tests were adjusted for covariance and limited to baseline percentilepredicted FVC in all analyses. Wilcoxon's signedrank test was used for continuous variables, and Fisher's exact test was applied for categorical variables.
All comparisons among clusters were adjusted with the Bonferroni correction method. Data are median (95% CI), unless otherwise indicated. All statistical tests were twosided, and p<0·05 was considered to be significant.

Role of the funding source
The funder of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report.

Results
415 (71%) of 581 participants recruited into the PROFILE study were eligible for this secondary analysis, while 180 (40%) of 455 participants from the independent For scripts see https://github. com/MTWGroup/TrajectoriesIPF1 dataset were eligible for inclusion in the replication cohort (figure 1). Mean baseline FVC was 80·1% (SD 18·9). 321 (77%) participants were men, 94 (23%) were women, and mean age among participants was 70·6 years (SD 7·8; appendix p 13). Data on complete lung function were available in 82 (20%) participants. Data were missing due to death in 173 (42%) participants, of whom 48 (12%) died during the first year, 68 (16%) in the second year, and 57 (14%) in the third year. These missing data values meant that 488 (29·4%) of 1660 data points required imputation for the full analysis of lung function. A further 196 (11·8%) data points were missing for unknown reasons. Overall, from the complete PROFILE dataset of 415 patients, the dataset excluding patients with data missing due to death comprised 242 patients, and all data points were available from 82 patients who completed all spirometric visits.
In the replication cohort, comprising 180 individuals who qualified for imputation, the optimal number of clusters was also four (appendix p 8). SOM analysis showed similar cluster architecture with regard to the size of each cluster and the nature of lung function trajectories, with 74 (44%) participants in cluster 1, 38 (21%) in cluster 2, 42 (23%) in cluster 3, and 26 (14%) in cluster 4 (table 2; appendix p 8). Furthermore, the four clusters generated by SOMs in the PROFILE dataset were reproduced with the kmeans clustering algorithm. These kmeans clusters had identical architecture and similar membership allocation to those generated by SOMs (appendix p 12).
Participants in cluster 1 followed a linear decline in lung function, and this represented the most common phenotype in both the PROFILE cohort and the replication cohort (figure 3; table 1; appendix p 8). These patients had similar median survival with and without adjustment for baseline FVC in both cohorts (2·87 years [IQR 2·29-3·40]; figure 4; appendix p 10). In cluster 1, participants were generally younger and contained more never smokers, although the association between smoking status and disease trajectory was not significant (p=0·084). Biochemically, cluster 1 was associated with the highest concentrations of serum surfactant proteinD (SPD; figure 5).
Cluster 2 was the third most common cluster in both the PROFILE cohort and the replication cohort, and had a low number of never smokers in both cohorts (table 1). This cluster was associated with older age and a history of ever smoking. Concentrations of SPD, as well as the FEV 1 to FVC ratio, were significantly lower in cluster 2 than in clusters 1 and 3 ( figure 5A, D).   figure 4). The unadjusted median survival of participants in cluster 2 did not differ significantly from that of participants in cluster 1 or cluster 3 in the replication cohort (appendix p 9), but was similar to that of participants in cluster 2 in the PROFILE cohort when adjusted for baseline lung function (appendix p 10).
Participants in cluster 3 showed an initial decline in lung function with subsequent stabilisation, and this cluster was the second most common cluster in both cohorts (figure 3E, F; table 1). This cluster was associated with high mortality (figure 4; table 1), high FEV 1 to FVC ratio (figure 5), and high concentrations of PROC28 (figure 5C). Similarly, high mortality was observed among participants in cluster 3 in the replication cohort (appendix p 10).
Cluster 4 represented the smallest group of patients in both cohorts and reflected stable lung function over 3 years  (table 1) and had low concentrations of SPD, but a tendency for high concentrations of REC1M and the lowest FEV 1 to FVC ratio (figure 5). Cluster 1 had the highest number of ever smokers in the PROFILE cohort, although this was not significantly associated (table 1). This cluster had the longest median survival of the PROFILE cohort (5·56 years [95% CI 5·18-6·62]), differing significantly from that of cluster 1 (p<0·0001), cluster 2 (p=0·03), and cluster 3 (p<0·0001; figure 4). Although cluster architecture was similar between the PROFILE cohort and the replication cohort, mortality was higher in the replication cohort, even after adjusting for baseline FVC (appendix p 10). The genetic analysis of common variants of idiopathic pulmonary fibrosis between the four clusters showed some nominal associations, but nothing of significance. Furthermore, no cluster was found to be associated with frequency of the atrisk MUC5B minor allele (table 1; appendix p 15).

Discussion
This study used machine learning methods to analyse lung function trajectories in two cohorts of patients with idiopathic pulmonary fibrosis. Using a random forest MCMC approach, we overcame the challenges associated with missing data and low statistical power in simple interpolation methods, such as last observation carried forward and simple linear regression. 4,6,8,12,13 Conventional linear regression was acceptable for imputing data in the first year, similar to previous studies, including 12month daily home spirometry studies. 22,23 However, in these studies, a degree of heterogeneity exists that is not observed by regression to the mean, even in home spirometry studies that are often limited to short durations. 22,23 We applied a modelbased cluster analysis that, following a series of internal sensitivity and validity analyses, showed four discrete clusters of lung function trajectory. These clusters were associated with distinct anthropometric features with important implications for clinical management and future clinical trial design.
Cluster analysis in interstitial lung disease is an emerging concept. At least three studies have performed such analyses using registry cohorts, integrating various clinical features (including comorbidities) in an attempt to identify distinct phenotypes. [24][25][26] However, these studies did not seek to identify discrete patterns of disease    behaviour in patients with idiopathic pulmonary fibrosis. Our analyses identified four distinct FVC trajectories, which challenge the current understanding of the natural history of idiopathic pulmonary fibrosis. 3,6,7,[10][11][12]27 Patients in clusters 1 and 3 showed disease trajectories that followed the expected decline in lung function over the first year, and this continued throughout the duration of illness for patients in cluster 1, but stabilised for patients in cluster 3. Patients in clusters 1 and 3 were, unsurprisingly, more likely to have data missing due to death. More surprisingly, a third of patients in the overall cohort followed an alternative trajectory (ie, clusters 2 and 4) and showed either improved or stable lung function in the first year followed by a conventional trajectory (cluster 2), or remained stable throughout the duration of the study (cluster 4). Clusters 2 and 4 were associated with a better prognosis in patients with incident idiopathic pulmonary fibrosis than were clusters 1 and 3. Similar findings were found in a post hoc analysis of the INPULSIS studies, which investigated the efficacy and safety of nintedanib added to pirfenidone in patients with idiopathic pulmonary fibrosis. However, this analysis was performed without imputation, which might underestimate the effect in patients receiving placebo and lead to immortal time bias in favour of therapy, thus reinforcing the need to undertake imputation in such analyses. 4 The reasons behind the improvement in lung function among participants in cluster 2 are unclear, but there are several possible explanations. These patients might have had acute, or infective, exacerbations at enrolment into both studies that improved before the typically observed deterioration in lung function occurred. 28 Nevertheless, this potential reason is unlikely given that it would require over 20% of patients with idiopathic pulmonary fibrosis to have an acute or infective exacerbation within a 6month period of enrolment into both studies. Although such exacerbations are common, most estimates of incidence of acute exacerbations are lower than 20% within 1 year and are associated with poor prognosis. 3,28 Another explanation could be that cluster 2 included patients with concomitant chronic obstructive pulmonary disease who showed labile results on spirometry. 29 Compared with the other three clusters, cluster 2 contained more ever smokers and the FEV 1 to FVC ratio was lower; however, this cluster was not associated with a lower diffusion capacity for carbon monoxide, suggesting that these patients did not have substantial emphysema, the form of chronic obstructive pulmonary disease most commonly associated with idiopathic pulmonary fibrosis. 30 The disease trajectory in cluster 2 might reflect response to antifibrotic or immuno suppressive therapy, although patients in both studies were not receiving antifibrotic therapy at the time of recruitment, and treatment in idiopathic pulmonary fibrosis slows disease progression, rather than improves lung function. 5,6 Furthermore, it is possible that individual variation in FVC values might have resulted in unusual patterns of lung function following cluster analysis; however, this is unlikely given the large number of patients in cluster 2 and that the nature and size of the cluster were replicated in the replication cohort. Although the reasons for the observed increase in FVC over the first year in cluster 2 are yet to be elucidated, it is important to recognise its occurrence in a substantial proportion of patients with idiopathic pulmonary fibrosis. Failure to recognise this occurrence could mislead interpretation of clinical response if unequal randomisation occurs in trials. Importantly, in the CAPACITY1 study and the CAPACITY2 study, 6 the groups receiving placebo showed a different disease trajectory to those receiving pirfenidone, which ultimately delayed the regulatory approvals and introduction of this treatment into clinical practice. It is possible that this finding might have been due to the inclusion of patients showing cluster 2 or 4 trajectories in the placebo groups, who, combined, made up 40% of the patient cohort with idiopathic pulmonary fibrosis in both the PROFILE cohort and the replication cohort. Identifying patients who are likely to show the disease trajectories of clusters 2 and 4 could have practical implications for clinical management. The notion of a therapeutic trial would be misleading for patients in cluster 2, who are likely to show an improvement in FVC despite, rather than because of, therapy. Furthermore, the risk-benefit ratio of any given therapy might be altered among patients in cluster 4, particularly those recently diagnosed with an FVC of more than 80%. However, further studies to prospectively test the predictive power of such models are needed.
There are various strengths to the approach used in this study. We used several validity and sensitivity analyses to identify optimal imputation methods over both the short and long term. Importantly, when trying to define natural history, we were able to analyse data from a prospective cohort of largely untreated patients to generate the imputation models and to identify the clusters of lung function trajectory. Additionally, we were able to replicate the clusters' architecture in an external cohort of patients with idiopathic pulmonary fibrosis, and both cohorts shared some common anthropometric features.
However, our study also has several limitations. Due to the extent of missing data, there was only a small number of patients with idiopathic pulmonary fibrosis to effectively train the imputation algorithm, which might reduce the model's ability to effectively identify further smaller clusters. 21 Missing data are a challenge for all studies of idiopathic pulmonary fibrosis, both reducing statistical power and promoting survival bias in studies of individuals with idiopathic pulmonary fibrosis. 9,18 We observed that a random forestbased imputation method had the lowest NRMSD, particularly at later timepoints 2 years or more from baseline, suggesting that machine learning approaches were most appropriate for these studies. However, we acknowledge that any imputation algorithm might not reflect the accurate decline in FVC, especially over longer periods of time. We also recognise that this approach might also introduce potential imputation biases, affecting cluster formation and subsequent interpretation of results. However, we believe that the advantages of data imputation with machine learning over standard interpolation models, as well as the extended sensitivity, validity, and replication analyses performed, substantially outweigh these limitations. Heterogeneity of individual lung function trajectories exists within clusters, which is unsurprising given the nature of lung function decline in patients with idiopathic pulmonary fibrosis. 5,11 Nevertheless, this heterogeneity does not detract from strategies to stratify patients in clinical studies, given that, until now, all lung function trajectories were considered to be a uniform cluster. A further limitation to the current study is the absence of unadjusted replication of the relationship between cluster and mortality signal between the PROFILE cohort and the replication cohort. This difference might reflect the different nature of the two cohorts; PROFILE was a prospective cohort of patients with incident idiopathic pulmonary fibrosis, whereas the replication cohort was obtained from a registry cohort of patients with prevalent idiopathic pulmonary fibrosis and substantially less lung function at entry into the cohort. The small number of patients in cluster 4 of the replication cohort might have amplified the mortality signal. However, following adjustment for baseline FVC, clusters 1, 2, and 3 had similar survival in both cohorts.
This study identifies distinct trajectories of lung function in patients with idiopathic pulmonary fibrosis and has important implications for the development of clinical trials and clinical practice. Further improvement in collection of patient registry data and cluster methodology, as well as collaboration between research groups, will increase the accuracy of imputation and granularity of cluster analysis, thus facilitating further understanding of unique clusters of patients with pulmonary fibrosis, including those with pulmonary fibrosis of known cause. Development of these approaches could help to treat each patient with the correct treatment at the correct time.