First large-scale quantification study of DNA preservation in insects from natural history collections using genome-wide sequencing

1. Insect declines are a global issue with significant ecological and economic ramifications. Yet, we have a poor understanding of the genomic impact these losses can have. Genome-wide data from historical specimens have the potential to provide baselines of population genetic measures to study population change, with natural history collections representing large repositories of such specimens. However, an initial challenge in conducting historical DNA data analyses is to understand how molecular preservation varies between specimens. 2. Here, we highlight how Next-Generation Sequencing methods developed for studying archaeological samples can be applied to determine DNA preservation from only a single leg taken from entomological museum specimens, some of which are more than a century old. An analysis of genome-wide data from a set of 113 red-tailed bumblebee Bombus lapidarius specimens, from five British museum collections, was used to quantify DNA preservation over time. Additionally, to improve our analysis and further enable future research, we generated a novel assembly of the red-tailed bumblebee genome. 3. Our approach shows that museum entomological specimens are comprised of short DNA fragments with mean lengths below 100 base pairs (BP), suggesting


| INTRODUC TI ON
Determining changes in genome diversity is a central component of biodiversity monitoring (Jensen et al., 2022).Quantifying such change can reveal past and ongoing demographic processes contributing to our understanding of species and population genetic health and resilience (Morin et al., 2021).Furthermore, it can help determine how populations respond to environmental pressures (Pinsky et al., 2021), assess species adaptive potential and identify signatures of selection representing key adaptations (Colgan et al., 2022).
However, to accurately determine the extent and rate of changes in genetic diversity, we must first establish historical baselines (Díezdel-Molino et al., 2018), which we lack for many taxonomic groups.
Yet, to date we have a poor understanding of how genetic diversity has changed for the vast majority of wild insect pollinator populations over the past century, a period of both dramatic, large-scale land use change that has reduced pollinator diversity in the UK (Ollerton et al., 2014) and estimated global genetic diversity loss for wild species (Leigh et al., 2019).One opportunity to explore changes over this time period is to create genetic baselines from historical collections.
Museum collections are increasingly being used to provide information relevant to the conservation of biodiversity, above and beyond their function in systematic biology (Raxworthy & Smith, 2021).A challenge, however, is the DNA preservation within pinned and open-dried specimens, as post-mortem degradation reduces the quantity and quality of DNA retained in collections of historic specimens (Heintzman et al., 2014).A possible solution is to utilise Next-Generation Sequencing (NGS) approaches developed for retrieving and analysing ancient DNA (aDNA).These techniques have the potential to be an invaluable tool in determining historical genetic baselines for insect species, and subsequently provide an understanding of evolutionary and demographic processes that would not otherwise be possible.However, to best utilise and safeguard these historical collections, we must understand the extent to which age and/or storage conditions impact the rate of DNA degradation.
While there are no absolute standards for the measure of DNA preservation in museum specimens, several commonly used metrics to summarise the length, amount and authenticity of the sequence reads obtained: 1. Endogenous DNA content-the proportion of the sequenced reads that align to the reference genome.Here we refer to the endogenous percentage that aligns to the reference post quality filtering.This is an important factor in determining the required sequencing depth in a collections-based genomics project, and the ability to predict this in advance will inform the overall financial cost of the project.It is important to note that this metric is also impacted by the reference genome used, with increasing evolutionary distance between the target and the reference reducing the likelihood of mapping sequenced reads.
2. DNA fragmentation-mean read length is used as a proxy for the mean length of DNA molecules in the sample and a common method to determine the extent of fragmentation.Genome sequencing runs are typically described in terms of the number of base pairs sequenced, assuming that each read is the maximum length that can be sequenced.Therefore, the shortfall in information content per read is also an important consideration in determining project cost.Additionally, mean read length is an overestimate of the true mean fragment length, as the shortest DNA fragments are less likely to be recovered during DNA extraction and short sequencing reads are not aligned to the reference genome to prevent spurious alignment.However, as DNA fragmentation is a random, time-dependant process following an exponential decay model (Lindahl & Andersson, 1972; Lindahl a rapid and large-scale post-mortem reduction in DNA fragment size.After this initial decline, however, we find a relatively consistent rate of DNA decay in our dataset, and estimate a mean reduction in fragment length of 1.9 bp per decade.
The proportion of quality filtered reads mapping to our assembled reference genome was around 50%, and decreased by 1.1% per decade.
4. We demonstrate that historical insects have significant potential to act as sources of DNA to create valuable genetic baselines.The relatively consistent rate of DNA degradation, both across collections and through time, mean that population-level analyses-for example for conservation or evolutionary studies-are entirely feasible, as long as the degraded nature of DNA is accounted for.
aDNA, Bombus, collection genomics, DNA degradation, entomological collections, historical DNA, museum specimen, pollinators & Nyberg, 1972), the distribution of fragment length frequencies in the sample can be fitted as an exponential curve (Adler et al., 2011;Deagle et al., 2006;Schwarz et al., 2009).Here, we also explore whether the 'lambda' parameter (λ), which describes the shape of the exponential distribution, provides a less biased estimate of the mean fragment length.
3. DNA damage-the extent of chemical modification of the DNA molecules in the sample, with the most common form being hydrolytic deamination of cytosine molecules to form uracil, which is subsequently sequenced as thymine (Binladen et al., 2006;Brotherton et al., 2007;Gilbert, Hansen, et al., 2003;Gilbert, Willerslev, et al., 2003;Hansen et al., 2001;Sawyer et al., 2012;Schwarz et al., 2009).This common process occurs primarily towards 5′-ends of fragmented DNA molecules, and the impact on sequence accuracy is minimised (as in this dataset) by treatment with uracil-DNA glycosylase (UDG) to cleave the DNA at uracils (Hofreiter et al., 2001).This process leaves a proportion of deaminated sites at 3′-and 5′-terminal positions (Briggs et al., 2010), which can be measured to infer general patterns, and explore the extent to which this treatment is required.
4. Molecular preservation-previous studies have been able to detect the residual impact of nucleosome packaging on the degradation pattern of DNA extracted from historical and archaeological samples, due to the observation of periodic 'spikes' in the quantity of DNA at specific lengths (Kistler et al., 2017;Pedersen et al., 2014).In this paper, we use the strength of this pattern, the histone periodicity index, as an alternative measure of DNA preservation.
5. Complexity-the proportion of reads within the dataset that represent novel sequence information, and are not simply PCRamplification duplicates of other reads.Absolute measurement of complexity is made difficult by variation of library construction efficiency, due to batch variation in laboratory procedures and amplification efficiency (Head et al., 2014).However, a relative estimate can be made where a set of samples have been prepared in the same way.Complexity can be quantified either by measuring coverage for a given sequencing effort, to provide an estimate of the number of times unique reads capture the whole genome, or by estimating the required sequencing depth which would exhaust the information content of a library.If the molecular complexity of a given sequencing library is exhausted before the required depth-of-coverage is achieved, new libraries or extracts will be required, again increasing the project cost.
Here, we use these different measures of molecular preservation to explore DNA decay within a time series dataset of museum collection specimens of the red-tailed bumblebee Bombus lapidarius, collected in the UK over the last 130 years.This large dataset enables us to overcome some of the issues inherent in previous studies, as this is a large dataset generated in a consistent manner, by a single laboratory, element and species.This dataset, coupled with a novel genome assembly for B. lapidarius, allows us to consider differences in age, and storage location (museum collection), while investigating variation in endogenous DNA content (1), DNA damage and preservation (2-4) and sequence complexity (5).

| Museum specimen DNA extraction and library preparation
A version of the extraction protocol described in Dabney et al. (2013) was performed on each leg separately, with the following modifications.In the lysis stage, 180 μl Qiagen ATL Buffer for tissue lysis and 20 μl Proteinase K were added to each leg and heated at 56°C for 24 hr.DNA purification followed Dabney et al. (2013) with the modification stated in Brace et al. (2019), replacing the Zymo-Spin V column binding apparatus with the extender assembly from the High Pure Viral Nucleic Acid Large Volume Kit (Roche).DNA extracts were treated with USER enzyme; 20 or 30 μl of extract with 5 μl of USER enzyme for 3 hr at 37°C.Double indexed double stranded libraries were built following Meyer and Kircher (2010) with AmpliTaq Gold polymerase for the PCR amplification with PCR cycles varying from 13 to 20.All sequencing was performed on an Illumina NextSeq 500 (75 bp PE) at the NHM London sequencing centre.

| Quantification of endogenous DNA content
The endogenous percentage of total sequencing reads was calculated from the collapsed reads alignment after filtering for duplicates and mapping quality.Additionally, we calculated the 'raw' endogenous percentage of total sequencing reads prior to any filtering.

| DNA fragmentation: Read length distribution
For the original collapsed alignments, the read length distributions were extracted with SAMtools (v1.12), from which the mean and the median read length were calculated.Additionally, the insert size distributions of non-collapsed read alignments were extracted (SAMtools stats) to visualise how many endogenous reads are potentially removed at the collapsing stage during adapter trimming.A cut-off of 200 bp maximum insert size was applied to remove spurious alignment results.Finally, to achieve a more accurate measure of DNA fragmentation, we combined the distributions of the read length (collapsed alignment) and insert size (non-collapsed alignment), referred to here as the combined length.The mean combined length was also calculated using this combined length distribution.

| DNA fragmentation: λ parameter estimation
Read length distributions can be quite different to the true fragment length distributions of DNA present in the sample, due to limitations in DNA extraction library construction efficiency, and post-sequencing data processing (Kistler et al., 2017).In a standard aDNA pipeline, longer reads that do not overlap cannot be merged and those shorter than 25 bp are not aligned, resulting in artefactually low frequencies in the distribution.To determine whether these processes impact all samples equally, and to better estimate the mean fragment size in the libraries we applied a method developed by Kistler et al. (2017).This method attempts to mitigate these artefacts by estimating the λ from the portion of the distribution that best fits an exponential distribution, via maximum likelihood.For each sample, the read length density distributions, collapsed alignment only, were used to estimate the λ parameter using the protocol developed by Kistler et al. (2017), using the 'lambdaCalc.pl'and 'readLengthLambda.R' scripts supplied in their Supplementary Information.First, the distributions are checked for peaks at the highest read lengths, which would indicate reads longer than the maximum readable.Then the λ parameter is estimated for every size range going backwards from the longest read, selecting the value with the highest likelihood.The λ parameter was then used to estimate the expected 'true' mean fragment length (μ) for each distribution, using the formula μ = 1/λ.

| DNA damage quantification
The DNA extracts in this study were enzymatically treated to reduce the impact of deamination on the resulting data.The proportions of 5' C-T and 3′ G-A transitions in the first base position were estimated using mapDamage 2.0 (Jónsson et al., 2013).

| Molecular preservation: Histone periodicity index
As a result of the intermittent protection of lengths of DNA from fragmentation by histones, the resulting read length density distribution would show multiple local peaks as some fragment lengths would be more common than others.To visualise this, read length and combined length distributions were plotted.The protocol developed by Kistler et al. (2017), using the 'histoneCalc.R' script supplied in their Supplementary Information, was used to estimate the intensity of any identified local peaks in the distribution of read lengths from the collapsed alignment only, the histone periodicity index.S3).Each individual is represented by one library built with 20 μl of DNA extract, though PCR cycles vary (Table S3).BAM files were filtered for mapping quality (q 30), with no filtering for duplicates, and to mitigate variation in genome coverage, subsampled for 1 million reads using SAMtools (v1.12).Preseq failed to run for five samples due to lack of sufficient variation within the subsampled BAM.The proportion of the total expected available unique reads that were sequenced from the sample was then estimated by dividing the number of unique endogenous reads by the (preseq) estimated maximum number of unique reads in the library.
Model comparison was achieved via likelihood ratio tests using the lrtest function in the lmtest package (Kuznetsova et al., 2017).Parameters explored by multiple regression analysis include: endogenous DNA %; DNA fragmentation via the mean combined length and λ parameter of the collapsed alignment; DNA damage via the proportion of 5′ first base C-T transitions; molecular preservation via the histone periodicity index; and library complexity via the preseq-estimated maximum number of unique reads in the library.Multiple linear regression models used year of specimen collection and museum collection as explanatory variables, allowing the model intercept to vary by museum to detect variation between museum collections while accounting for variation in time since specimen death.An interaction between museum and year of collection was not investigated to test for a temporal pattern across specimens.The Natural History Museum London (NHM) was used as the reference group for intercept comparisons as they constituted the greatest sample size (n = 40) and specimen age range .Temporal analyses assumed linear relationships with sample age as DNA deamination and fragmentation by depurination occur at constant rates (Briggs et al., 2007;Lindahl & Nyberg, 1974;Lindahl & Andersson, 1972;Lindahl & Nyberg, 1972).As environmental variables are hypothesised to affect the rate of damage and fragmentation in ways which are poorly understood, further models were constructed with an exponential relationship by log-transforming each parameter (Table S4).

| Reference genome assembly
Approximately 15× coverage of Hifi reads were generated (Table S2).
Following the assessment of the primary assembly with Kraken2, blobtools and MitoFinder, 157 contaminant and mitochondrial contigs (5.7 Mb) were removed.The resulting assembly spans 1,473 contigs, has a contig N50 of 2.3 Mb and a longest contig of 14 Mb.The assembly represents 97.1% of the hymenoptera_odb10 BUSCO set complete and single copy, with 0.4% duplicated, 0.4% fragmented and 2.1% missing.

| Endogenous DNA content
Endogenous DNA recovery varied across the 113 B. lapidarius specimens (Figure 1).However, the majority of specimens contained relatively high levels of endogenous DNA post filtering, with a dataset median of 48.20% and a range of 0.12-60.68%(Table S3).Only two specimens contained <5% endogenous DNA.As the sequencing effort for three samples was relatively low, the 110 samples with >1 million read pairs sequenced were included in further analyses.
Endogenous percent positively correlates with year of collection (adj. R 2 = 0.10; p < 0.01).For every year a specimen aged, the percentage of endogenous DNA decreased by 0.11%, equating to a 12.43% reduction over the 113 years the study spans (Figure S4).

| DNA fragmentation
To examine variation in DNA fragmentation, we first derived the combined density distribution of read length (collapsed alignment; Figure S3) and insert (non-collapsed alignment) lengths for each sample (Figure 2a).We then compared the variation in the mean combined length (range 44-87 bp; Table 1) through time and across different collections (Figure 2b).Mean combined length positively correlates with year of collection (adj.R 2 = 0.84; p < 0.01).Every year since collection, mean combined length decreased by 0.19 bp, equating to a decrease of 21 bp over the 113 years.The model intercept was consistent across museums (p > 0.05) except for the World Museum, which had a 9.41 bp lower intercept than the reference, NHM (p < 0.01; Table S4).
We derived the λ parameter of the read length distribution (collapsed alignment only, Figure S3) and used this to calculate the mean value for the fitted distribution, equivalent to the estimated true mean fragment length.Estimated mean fragment length positively correlated with mean combined length (adj.R 2 = 0.85; p < 0.01; Figure 3a; Table S4).λ was negatively correlated with collection year (Adj.R 2 = 0.61; p < 0.01), indicating an increase in fragmentation with time since collection (Figure 3b).λ increased by 0.00046 per year after collection, with an expected increase of 0.052 over the 113 years the study spans, equivalent to a 19.28 bp decrease in estimated mean fragment size.

| DNA deamination
The proportion of 5' C-T deaminated sites in the first base position was estimated for 110 samples and plotted against specimen age (Figure 4), which showed no relationship (p = 0.961).However, the model did demonstrate significant variation between museums after accounting for year of collection (F[4,104] = 5.41; p < 0.01; adj.R 2 = 0.13; Table S4).

| Molecular preservation
We measured the extent of periodicity in our dataset (Figure 5), detecting histone fragmentation bias in the majority of samples (102/110).Histone periodicity index was not significantly influenced by collection year (p = 0.07).The model did, however, demonstrate significant variation between museums after accounting for year of collection (F[4,104] = 4.566; p < 0.01; adj.R 2 = 0.18), with the NHM demonstrating a significantly higher intercept than the other museums (Table S4).

| Library complexity
The maximum number of unique reads in the library was successfully estimated using preseq from a one million read subsample in 69 specimens (Figure 6).The maximum number of unique reads is positively correlated with collection year (estimate = 619,700; p = 6.84 × 10 −3 ), although there is a high degree of unexplained variability in the datasets (adj.R 2 = 0.06).

| DISCUSS ION
Here we highlight an aDNA methodological approach to enable sufficient DNA extraction to conduct NGS shotgun sequencing from just a single insect [bumblebee] leg for specimens as old as 113 years.
We demonstrate the potential to use museum specimens to investigate genome-level changes over time, and our findings suggest even older specimens could be studied.By studying a large number of individuals, and using a species-specific reference genome, we were able to quantify the rate and extent of change in several important parameters, including decreases in endogenous DNA content, DNA fragmentation and the estimated total unique reads as a proxy for complexity.We also fitted an exponential function to the read length distribution to estimate the mean fragment length, and  Most of the DNA fragments we observe in this study are below 100 bp, even in the more recent samples from 2004.This suggests that DNA fragmentation must occur rapidly after death in insect specimens that are part of museum collections (which tend to be dry-pinned).DNA fragmentation after this initial rapid reduction, however, appears to occur at a much more gradual rate, and this was broadly consistent across different collections.This highlights that specimen age remains an important factor in the extent of DNA degradation, and by extension the availability of information in museum specimens.Similar fragmentation patterns have been demonstrated for museum beetle specimens (Heintzman et al., 2014) and explain why PCR amplification approaches with long insert sizes between primers have struggled to amplify DNA from historical insect specimens (van Houdt et al., 2010;Ugelvig et al., 2011;Andersen & Mills, 2012).We find that the estimated true DNA fragment length and mean combined length are highly correlated, demonstrating that deriving λ provides an accurate estimate for the true mean fragment length even when only collapsed reads are used, as standard in aDNA bioinformatic pipelines.
Base deamination, especially C-T transitions at 5′ fragment ends, accumulates post mortem and is a common diagnostic of aDNA (Briggs et al., 2007;Sawyer et al., 2012).As a result, UDG is commonly used in aDNA studies to aid mapping of damaged reads (Briggs et al., 2010;Hofreiter et al., 2001) and was applied to samples in this study.This removes uracils from the sample leaving only a small proportion (Briggs et al., 2010), which explains the limited evidence of post mortem deaminated sites in this study.We note that it is possible that estimations of fragmentation were impacted by the action of UDG cleaving deaminated sites, affecting read and insert length distributions.However, our specimens are relatively young to have accumulated base deamination, and deamination is more common at strand ends (Sawyer et al., 2012), where UDG would have a negligible impact on read length.It is therefore unlikely that this would have a significant effect on estimated fragmentation rates, and so the trends identified here are likely due to post mortem fragmentation.
Overall, the results demonstrate consistency and predictability of DNA preservation across the five museums.This indicates current curatorial practises are favourable for degraded DNA survival, with most variation in endogenous DNA and fragmentation explained by differences in specimen age.Variation between individuals and museum collections may be accounted for by differences in curatorial techniques both during the specimens' time within the collection (museum-specific signatures) and prior to museum collection deposition.For example, variation in histone signal intensity and deamination was not significant over time, but between collections, suggesting that storage conditions may impact the extent of histone-associated fragmentation bias.Kistler et al. (2017) suggested that the histone periodicity index is temperature dependent.
The presence of sample outliers in all parameters demonstrates the unique conditions experienced by individual specimens impacts DNA degradation with variable effects on damage parameters.However, the consistency between museums in the retrieval of endogenous DNA demonstrates the feasibility of using specimens from multiple collections to allow large-scale genomic studies utilising NGS sequencing.Furthermore, of a practical note, these data F I G U R E 5 Periodicity varies between different museums but not with specimen age.Histone periodicity index is a measure of the intensity of local peaks in the read length distribution (Figure S3) due to periodic protection of DNA from fragmentation due to association with histones (Kistler et al., 2017).Boxes span the first and third quartiles; inner lines = median values; red crosses = mean values.F I G U R E 6 Estimated total number of unique reads decreases with time after death.(a) Preseq-estimated total number of unique reads from 69 Bombus lapidarius samples against year of collection.Best fit line calculated from multiple regression of estimated total number of unique reads as a dependent of collection year and museum with the equation y ~ 6.20 × 10 5 × − 1.13 × 10 9 .Model intercept calculated for the reference group (NHM).Data generated from specific sequencing runs (1-4) are indicated; samples sequenced on specific runs were processed identically (Table S3).If we are to conserve insect abundance and diversity, and the ecosystem services that they provide, in the face of ongoing environmental changes, we need to understand past processes to predict (and hopefully mitigate) future declines.The erosion of genetic diversity within populations is especially worrying given that this diversity will be fundamental to the ability of insects to adapt to climate change in particular.State-of-the-art aDNA sequencing methods have the potential to reveal insights into past genetic changes that were inconceivable just two decades ago and we hope that in the future they become a standard part of the toolkit for conservation biology.
This study used an individual leg per specimen for 113 pinned B. lapidarius drones (haploid males) collected from 1891 to 2004 housed in the collections of five British institutions; Natural History Museum (NHM London), Oxford Museum of Natural History, Tullie House Museum and Art Gallery (Carlisle), World Museum (Liverpool) and National Museums Scotland (Edinburgh).All specimens were treated as degraded DNA specimens with all pre-PCR laboratory work taking place in the aDNA laboratory in the NHM London.
Figure S1) collected on 23 August 2019 at the Earlham Institute (TG179075; Lat: 52.622282, Long: 1.2190789).The specimen was snap frozen on dry ice following collection and stored at −80°C until DNA extraction.Extractions were conducted using the Qiagen MagAttract HMW DNA kit, with modifications (see Supplementary Information).DNA concentration was measured using the Qubit HS kit and the Nanodrop was used to measure extraction purity.The distribution of HMW DNA fragment sizes was measured using the Agilent FEMTO Pulse instrument.HiFi library preparation and Pacific Biosciences sequencing were carried out by Genomics Pipelines at the Earlham Institute (see Supplementary Information).
2041210x, 2023, 2, Downloaded from https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13945by Test, Wiley Online Library on [22/02/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons LicenseHifi reads were extracted from the raw Pacific Biosciences output by the Earlham Institute core bioinformatics group using the Pacific Biosciences SMRTlink pipeline (v10.1.0.119588).Prior to assembly, Hifi reads were trimmed for adapter sequences using Cutadapt The model intercept was consistent across museums (p > 0.40) except for the World Museum, which had an intercept significantly lower than the NHM London reference by 8.17% (p < 0.01).Assuming this relationship remains constant, the model formula y ~ 0.11x − 156.63 suggests that it may be possible to sequence endogenous DNA from pinned insects collected in any year after 1488, which includes all known collections.

F I G U R E 2
Mean combined length decreases with time after death.(a) Combined length (read length + insert size) density distributions for all 110 Bombus lapidarius samples.Length is given in base pairs (bp).Colour represents the year of collection.(b) Mean combined length against year of specimen collection.Best fit line calculated from multiple regression of mean combined length (Table S4) as a dependent of collection year and museum with the equation y ~ 0.19x − 302.58.Model intercept calculated for the reference group (NHM).
Downloaded from https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13945by Test, Wiley Online Library on [22/02/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License The range and median values across all samples across the dataset for the three fragmentation parameters calculated for each specimen.Mean fragment length estimate = 1/λ.All values given in base pairs (bp) and rounded to the nearest integer F I G U R E 3 λ decreases with time since death, and the corresponding estimated mean fragment length is highly correlated with mean read length.(a) Estimated mean fragment length against mean combined length.Best fit line calculated from linear regression (TableS4) of estimated mean fragment length (μ) as a dependent of mean combined length, with the equation y ~ 0.72x − 26.12.Lengths are in base pairs.(b) Variation in the λ parameter of the read length distribution (FigureS3) within and between different museums.Boxes span the first and third quartiles; inner lines = median values; red crosses = mean values.

F
Deamination varies between different museums but not with time since death.Data show variation in the proportion of first base position sites with 5′ C to T deamination within and between different museums.Boxes span the first and third quartiles; inner lines = median values; red crosses = mean values.Downloaded from https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13945by Test, Wiley Online Library on [22/02/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License provide a useful guide to museum curators when assessing the likely success of projects requiring destructive sampling of specimens.
AdapterRemovalv2.2.4 (Schubert et al., 2016)was implemented to trim adapter sequences, collapse overlapping read pairs with a minimum 11 bp overlap, filter for a minimum read length of 25 bp and trim Ns and low-quality bases.Trimmed collapsed reads were aligned to the Bombus_lapidarius_EIv1 genome, using bwa aln (v0.7.12-r1039,Li et al., 2009; -l 1024 -n 0.01 -o 2 -q 15 2041210x, 2023, 2, Downloaded from https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13945byTest, Wiley Online Library on [22/02/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)onWiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License2.10| Library complexity The preseq (v3.1.2,Daley & Smith, 2013) function 'lc_extrap' (-B) was used to (1) estimate the maximum number of unique reads available in the library by extrapolating the number of unique aligned reads at greater sequencing depths and (2) estimate the expected number of unique reads per 10 million aligned reads.For this analysis, a reduced dataset of 74 individuals was curated to reduce confounding factors such as number of libraries sequenced per sample, quantity of DNA input into library and number of PCR cycles (Table Downloaded from https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13945by Test, Wiley Online Library on [22/02/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License y ~ 0.11x − 156.63.Model intercept calculated for the reference group (NHM).