Intra-exon motif correlations as a proxy measure for mean per-tile sequence quality data in RNA-Seq
File(s)RNA_Seq_IVT.pdf (587.52 KB)
Working paper
Author(s)
Alnasir, Jamie
Shanahan, Hugh
Type
Working Paper
Abstract
Given the wide variability in the quality of NGS data submitted to public repositories, it is essential to identify methods that can perform quality control on these datasets when additional quality control data, such as mean tile data, is missing. This is particularly important because such datasets areroutinely deposited in public archives that now store data at an unprecedented scale. In this paper,we show that correlating counts of reads corresponding to pairs of motifs separated over specific distances on individual exons corresponds to mean tile data in the datasets we analysed, and can therefore be used when mean tile data is not available.As test datasets we use the H. sapiens IVT (in-vitrotranscribed) dataset of Lahens et al., and a D.melanogaster dataset comprising wild and mutant types from Aerts et al. The intra-exon motif correlations as a function of both GC content parameters are much higher in the IVT-Plasmids mRNA selection free RNA-Seq sample (control) than in the other RNA-Seq samples that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection(IVT-polyA, wild-type, and mutant). There is considerable degradation of similar correlations in the mutant samples from the D. melanogaster dataset. This matches with the available mean tile data that has been gathered for these datasets. We observe that extremely low correlations are indicative of bias of technical origin, such as flow cell errors.
Date Issued
2020-08-24
Citation
2020
Publisher
bioRxiv
Copyright Statement
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.