If I have to mention something about data quality of RNA-seq and Chipseq in the form of table of Fig but are my different option. I have seen http://www.cell.com/cms/attachment/2021769037/2041633798/mmc1.pdf but it mentions about coverage, which is again a confusing term. I can always give XXX no of raw reads aligned XXXXX or unique XXXX number of reads. The quality score is XXX. Is there any other best way to show some attractive parameters which I can use to just back up that our data is good quality data format which can be summarized as table or Fig.I also could not find lot of publications showing such quality matrix.
Here's a list of some metrics that you might want to consider:
For RNA-seq and ChIP-seq:
Duplication. What percentage of your reads are duplicates?
Percent mapped. Number mapped divided by number sequenced.
Coverage. On average, how many reads cover one base in the genome?
Median mapping quality. Each mapped read has a quality score describing how well the read sequence matches the reference. It might be informative to compare median mapping quality across multiple libraries.
For RNA-seq:
Number of genes detected with 1 or more reads.
Number of genes detected in 2 or more libraries.
5' or 3' bias. Where do reads map along the length of the gene? There tends to be bias.
For ChIP-seq:
Percent within peaks. After calling peaks with your preferred method, determine how many reads are overlapping the peaks.
Coverage within peaks. On average, how many reads cover one base in the peak regions?
You can get started with the following tools:
Picard is set of Java command line tools for manipulating high-throughput sequencing data (HTS) data and formats.
picardmetrics is a Bash script that runs up to 12 Picard tools on a BAM file and collates all of the output files into a single table with up to 90 different metrics. It also creates the two Picard files required for CollectRnaSeqMetrics.
RNA-SeQC is a Java program which computes a series of quality control metrics for RNA-seq data.
RSeQC is a Python package that provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data.
ADD COMMENT
• link
updated 5.0 years ago by
Ram
44k
•
written 9.1 years ago by
Kamil
★
2.3k