A National Initiative in Data Science for Health: An Evaluation of the UK Farr Institute

Objective: To evaluate the extent to which the inter-institutional, inter-disciplinary mobilisation of data and skills in the Farr Institute contributed to establishing the emerging field of data science for health in the UK.<br><br>Design and Outcome measures: We evaluated evidence of six domains characterising a new field of science: defining central scientific challenges, demonstrating how the central challenges might be solved, creating novel interactions among groups of scientists, training new types of experts, re-organising universities, demonstrating impacts in society. We carried out citation, network and time trend analyses of publications, and a narrative review of infrastructure, methods and tools.<br><br>Setting Four UK centres in London, North England, Scotland and Wales (23 university partners), 2013-2018.<br><br>Population Subsets of the UK’s 65 million population with health records accessible for research.<br><br>The Farr Institute published a research corpus around a central scientific challenge, demonstrating insights from electronic health record (EHR) and administrative data at each stage of the translational cycle in 593 papers with at least one Farr Institute author affiliation on PubMed. Sample sizes showed some evidence of increase but remained less than 10% of the UK population in primary care-hospital care linked studies. The Farr Institute established the first four ISO27001 certified trusted research environments in the UK, and approved more than 1000 approved research users, published on 102 unique EHR and administrative data sources and established open platforms for the scalable re-use of EHR phenotyping (&gt;70 diseases, CALIBER). The co-author publication network expanded from 944 unique co-authors (based on 67 publications in the first 30 months) to 3839 unique co-authors (545 papers in the final 30 months). 4/5 Centres established 27 new faculty (tenured) positions, 3 new university institutes, and 3 new master’s courses, training &gt;400 people at master’s, short-course and leadership level and 48 PhD students. There were over 2300 citations for the 10 most cited papers and Farr research informed eight practice-changing clinical guidelines and policies relevant to the health of millions of UK citizens.<br><br>Conclusion: The Farr Institute played a major role in establishing and growing the field of data science for health in the UK, with some initial evidence of benefits for health and healthcare. The Farr Institute has now expanded into Health Data Research UK but a key challenge is to network such activities internationally.<br><br>What is already known:<br><br>• National research initiatives in data science for health are under way in several countries seeking to harness insights from electronic health record (EHR) and administrative data at regional and national scale for patient and public benefit.<br><br>• One approach to grow this emerging field, adopted by the UK, is to establish a dedicated national research institute, the Farr Institute. <br><br>• We do not know how effective such initiatives are: multi-centre, inter-disciplinary research initiatives are common, but there is a lack of research evaluating such initiatives (in general) and national research institutes (in particular). <br><br>• The Farr Institute ran from 2013 until 2018 when its larger-scale successor, Health Data Research UK, was established.<br><br>What this study adds:<br><br>• We provide a framework of six domains relevant for evaluating new inter-institutional, inter-disciplinary initiatives seeking to establish and grow an emerging field of science: defining central scientific challenges, demonstrating how the central challenges might be solved, creating novel interactions among groups of scientists, training new types of experts, re-organising universities, demonstrating impacts in society.<br><br>• We show that the Farr Institute created new activities in and across each of the six domains for developing a distinctive research field.<br><br>• The Farr Institute demonstrated the ability of multiple UK health research funders and multiple universities to partner in mobilising data, methodology and expertise across disciplines, organisations and information governance domains, resulting in a larger scale of research and improved methodology.<br><br>• We have demonstrated globally relevant challenges and opportunities for developing data science for health across institutional and disciplinary barriers, consistent with the need for big investigation not simply big data.

We carried out citation, network and time trend analyses of peer-reviewed research publications, and a narrative review of infrastructure, methods and tools and teaching, and impacts.

Identification and analysis of platforms, methods, tools and impacts
We identified clinical guidelines and policy documents citing Farr Institute publications through annual reports to funders, automated software used by funders to capture a range of outputs and impacts (Researchfish) and by contacting investigators.

Identification, annotation and statistical analysis of peer-reviewed publications
We carried out citation, network and time trend analyses of Farr Institute publications identified from a PubMed search with manual annotation in a subset. We identified publications by a search of PubMed (search term: Farr Institute [AD]) 18 September 2018). We manually annotated another subset of Farr publications identified by the four Centres of the Farr Institute as reflecting their output (25 per centre, 100 total). QL and HH manually annotated the full text publications, extracting information on attribution to the Farr Institute, university, NHS and other affiliations, sources of EHR and administrative data methods, research theme, disease area, departmental affiliation. We defined attribution to the Farr Institute as at least one author who: listed Farr Institute as an author affiliation in PubMed, acknowledged funding for the Farr Institute, or was in receipt of funding from the Farr Institute. To be eligible each publication reported the use of one or more source of EHR or administrative data or methods directly relevant to data science. Data sources were classified as primary care, hospital discharge data (Hospital Episode Statistics for England, Patient Episode Database for Wales, Scottish Morbidity Record), detailed hospital data, disease and procedure registries, mortality, other health, and socio-economic and other non-health data. Consented studies without use of such EHR or administrative data were not eligible. We extracted for each publication the number of people providing the denominator population (sample size) classified as healthy (general population sample) or based on specific disease or procedure. We classified research themes as: citizen driven health, discovery science, quality and outcomes, trials and public health. We analysed the publications in terms of scale, number and type of data sources reported, cross-centre activity, citation number. We visualised the change over time in scientist network behaviour in the Farr Institute, with co-author publications networks based on the 593 unique publications, using Cytoscape. 13 We used the halfway time point, comparing the first vs the final 30 months of the Farr Institute.
Electronic copy available at: https://ssrn.com/abstract=3312791  Table 1), with a total of 2466 citations. Four of these highly-cited papers illustrate the higher resolution of using linked EHR. 14,15,16,17 These different research themes were applied across different clinical areas, including cardiometabolic, maternity and child health, mental health, cancer, renal and respiratory (Figure1 bottom panel).

Central scientific challenge
There was some evidence of a modest increase in the scale (number of people analysed in the denominator population) of research over time in these publications (Figure 2), based on linked primary-secondary care data in adults. But by 2018 this represented only 6.15% of the UK population. 18 There was just one paper that used the whole of England's hospitalisation data: Freemantle and colleagues 19 analysing weekend mortality effects using 14.8 million admissions, and several using all England's deaths data.

How the challenge may be addressed: access to research-ready data
In 2013 there were no independently accredited Trusted Research Environments (TRE) for NHS data: by 2018 there were four (one in each centre) ISO 27001:2013 certified data safe havens ( Table 2a). The TREs provide secure remote access, a safe environment for the analysis of sensitive patient identifiable data, a prerequisite for receiving unconsented, individual level health data for research use. We found evidence that Farr activity enabled other scientific fields: with over 1000 approved users on these 4 data safe havens working on over 300 research projects (the majority being external, having no Farr Institute funding). The Farr enabled the research use of diverse anonymised patient records, linked across primary care and secondary care, including NHS imaging data, blood laboratory values and reimbursed prescriptions (Table   2b). There was a cumulative total of 102 unique data sources reported in these publications (Figure 3), with 13% from primary care, 17% limited coded hospital data, 8% detailed hospital data, 19% registries of disease and procedures, 26% socio-economic and environment, 6% death data and 12% other health data. The setting and names of each unique data source reported in these publicationsare shown in Suppl Table 2.

How the challenge may be addressed: phenotyping
In 2013 there were no openly accessible portals for defining diseases and health-related conditions using electronic health record data (EHR 'phenotyping'). The Farr Institute supported several initiatives in disease phenotyping (Table 2c): these included CALIBER, an open platform 20 of re-usable EHR phenotypes (code lists + logic + validations) for over 70 diseases which have been re-used in more than 50 publications with Electronic copy available at: https://ssrn.com/abstract=3312791 more than 80 ongoing projects. 21 In addition there were several publications of EHR phenotypes in Wales and Scotland 22 and a clinical code repository. 23 Methods of surfacing the entire structured and unstructured data in a hospital have now been demonstrated in three hospitals with CogStack and SemEHR. 24

Novel interactions among scientists sharing common interests
Based on author affiliation, the search [Farr Institute[AD]] on PubMed returned 594 unique publications (from inception to 18 September 2018). Figure 4 shows that there was a large expansion of co-author publication networks comparing the first 30 months (67 publications with 944 unique co-authors) and the final 30 months (545 papers and 3839 unique co-authors). Suppl Figure 5 shows that overall across the 100 publications, 28% included Farr Institute as both author affiliation and funder acknowledgement, 14% as author affiliation only, 11% as funder acknowledgement only and 42% as Farr-funded investigator only, as confirmed by the centres. There was some evidence that over time both author affiliation and funder acknowledgement increased. Based on the departmental affiliation of co-authors there was some evidence of greater inter-disciplinarity in the last 30 months of the Farr Institute compared to the initial 30 months (Suppl Figure 6). We identified 17% of publications involving universities from across two or more Farr

Training new types of skills
In total 432 people were trained at master's, short-course and leadership level; most of these opportunities did not exist in 2013 ( Table 3). There were 48 PhD students. potentially affecting the type or duration of drug treatment), and implementation of genomic medicine. The change in practice recommendations potentially affects more than a million UK citizens.

University organisation
More widely Farr informatics research informed changes in government strategy from a centralised to a decentralised approach to integrating place-based health and administrative data for multiple analytic purposes. 26 This work also generated a £20m pilot of problem-based data integration, pulling through data by addressing care pathway blockages and research questions of importance to the local community in regions of 3-5m population. 27 This became the blueprint for England's Local Health and Care Record Exemplars. 28

DISCUSSION
Clinicians, patients and policy makers have growing expectations of the use of data to provide research insights with the potential to improve health and care outcomes. 29 New scientific fields tend to have highpriority defining characteristics; we provide evidence in six recognized domains suggesting the Farr Institute played a major role in establishing and growing the field of data science for health in the UK. The experience of the Farr Institute has informed the design of Health Data Research UK (HDR UK), and this evaluation is relevant to the inter-institution, inter-disciplinary challenges of scaling up health science around big data in many parts of the world.

Evolution of UK national research institute in policy context
A substantial achievement of the Farr Institute and its funders was the founding of its larger successor HDR UK. The key differences and similarities of the two organisations are shown in Suppl Table 1

Rationale for national research initiatives in data science for health
Countries differ in their approaches to advancing data science for health. Currently, as far as the authors are aware, other countries have not established a national research institute dedicated to data science for health directly analogous to the Farr Institute or HDR UK. The challenges facing the Farr Institute, and now HDR UK, are common to any research initiative based on catalysing inter-institutional and inter-disciplinary collaboration. Previous policy reports have recommended the need for intra-national methodological developments in data science for health as an important basis for international collaboration. 30 Central scientific challenge: scale Electronic copy available at: https://ssrn.com/abstract=3312791 Providing a 'more powerful telescope' by enabling EHR and administrative data at greater scale (larger sample sizes) is part of the central scientific challenge. Although nationwide primary care data exist in the Institute paved the way for federation of research data queries and distributed analytics across regional data aggregations.

Central scientific challengeacross the translational cycle
Most biomedical research disciplines are focused on a particular phase of the translational cycle: the Farr Institute demonstrated that a distinctive contribution of data science for health is that EHR and other sources of data 'in the wild' can link investigators across all phases of the translational research cycle. The Farr Institute made a start in the UK: the ambition, which HDR UK has taken on, is to constructively disrupt current models of evidence-based medicine, clinical practice and translational research, including the way that research is organised and funded.

Central scientific challenge: record linkages
The original funding call emphasised the importance of novel and sustainable record linkages. In Wales (SAIL), Scotland (IDRIS) and some English regions (Connected Health Cities) there are data linkage and trustworthy research environments that have fuelled numerous research outputs. For example, in Wales primary care data (including narrative) are linked to hospital admissions data, dispensed prescribing, blood laboratory values and a wide range of socio-economic data. This breadth and depth of linkages, and their sustained accessibility by researchers, have not emerged across larger populations such as England. In England the opportunities for developing a growing, sustainable environment for record linkages were severely curtailed by care.data and have only recovered in the regional devolved approaches such as the NHS England Local Health and Care Record Exemplars. In annotating Farr Institute research publications, we found variable clarity on reporting of record linkage and were not able to easily identify how many linkages had been reported which were new and which might be readily accessible to future researchers.

Demonstrating methods for tackling central scientific challenge
The Farr Institute transformed the UK's ability to bring non-consented individual-level health data into trustworthy environments and make them available for other researchers, based on specifically approved projects. Nonetheless, there remain many different data governance environments and processes for data access for research, with much room for harmonisation and streamlining. We demonstrate here how the Farr Institute published on over 100 EHR and administrative data sources; in some situations these were the first research use of these data. Despite the undoubted progress reported here, the EHR data sources reported represent a tiny proportion of available data. The Farr Institute made a start with establishing methods and platforms for EHR phenotyping, and there remains scope to develop a national online facility to integrate data, methods and investigators.

Novel interactions among scientists
We visualised an 'explosion' of co-author networks. This reflects the willingness of investigators to selfidentify with the Farr Institute, as there was no monitoring of this practice at centre or national level, as well as extensive collaborations between those with and without Farr Institute funding. The top ten most cited

Training new skills
At master's, short-course and leadership levels the Farr Institute had a substantial effect on teaching and capacity development. A goal was to train 'hybrids' with new combinations of expertise at the intersection of traditional health and biomedical disciplines, computer science and analytics and software engineering. The Farr Institute directly funded course directors and lecturers on many of these courses; and some were directly marketed under the Farr Institute brand (e.g. the some of the short courses). The unmet need for training and capacity development is hard to gauge. An example of relevant UK context is that the proportion of all NHS consultants (n=45,000) who are clinical academics is not only low (4%) but has been falling each year between 2000 and 2016. 31 The emerging field of data science for health might seek to reverse this trend.

Impact on health and healthcare
The Farr Institute carried out research underpinning policies and recommendations to change clinical and public health practice, and shaping government policy in health data management and digital health innovation. We provide here examples of specific research findings and their relation to changes in policy and recommendations. However, the Farr Institute had no central mechanism of identifying such influence ( Table 4 is likely incomplete), nor of prospectively following research through policy recommendations to measure changes in health. In some cases Farr research may have impacts in later years; HDR UK might usefully establish a more systematic approach.

Impact on public engagement and public trust
In 2013 there were no national campaigns involving patients and the public in research on patient data. By ways' case studies, explaining to patients and public examples of the benefits of Farr Institute research (13,000 followers and subscribers), and the #datasaveslives campaign, which has generated more than This illustrates the challenge of transparent and publicly accessible attribution to a national institute.

'Data science of data science': towards a framework for evaluation
There are important limitations to this evaluation. First, there is a need for a framework by which HDR UK and other national research initiatives might be more rigorously evaluated; this could include a 'data science of data science' with more thorough impact analysis than our simple evaluation of peer-reviewed publications. Increasingly, key elements of the research process are computable: the Farr Institute helped develop the Research Object paradigm, making research assets (data, software, tools) discoverable mapping use with identifiers. 33,34 Such work was adopted in the US BD2K programme and the international FAIR (Find, Access, Interoperate, Reuse) principles 35 and applied in research. 36 However, the Farr Institute had no prospective evaluation and no national curation of elements that are easily

Conclusion
In the UK, the Farr Institute played a significant role in beginning to grow the field of data science for health.
In 2013 there was little UK-wide co-ordination or visibility of research capabilities, including methods development or training in data science for health, and by 2018 this had been transformed. The importance of a national research institute in this field is evidenced by the UK's expanded commitment to HDR UK.   Farr researchers were core to the design and development of eMEDLAB, but it was funded by separate MRC award.

Ms
The CLIMB infrastructure was designed and is managed by Farr investigator (Thompson). eMedLab is a joint project with 6 institutions -UCL,QMUL, Crick, Sanger, KCL, EBI. The CLIMB project (Cloud Infrastructure for Microbial Bioinformatics) is a collaboration between Warwick, Birmingham, Cardiff, Swansea, Bath and Leicester Universities and The Quadram Institute Bioscience to develop and deploy a world leading cyber-infrastructure for microbial bioinformatics; providing free cloud-based compute, storage, and analysis tools for academic microbiologists in the UK.

JISC SAFE SHARE
How to authorise researchers to remotely access and share data safely and securely, from their own project sitewith information governance rules, varying with sensitivity of the data A federated identity management system and a high assurance network overlay encrypted to National Cyber Security Centre (NCSC) standards.
https://www.jisc.ac.uk/safeshare 53 Farr researchers initiated the project with JISC and where involved in all stages of the project from requirements to evaluation of the service.
From the success of the Safe Share pilot project, JISC have added Safe Share to their service catalogue, providing service to ADRN and PSN users.

No
An architectural solution and Irdmp prototype for handling big imaging data within a safe haven were designed and implemented. Currently being extended to run on the National Safe Haven hosted by Edinburgh Performance Computing Centre (EPCC) and to provide an anonymised extract from SMI for a small exemplar research project.
To date ~1.5 million cases with an estimated total size of 81TB have been transferred. Solution has been used to provide data for 4 different consented research projects linking phenotypic data with routinely collected imaging data within the SH in HIC.    Chief Medical Officer Annual Report 'Generation Genome' the government has establishsed the National Genomics Board chaired by the health minister, to implement recommendations