Question

Rnaseq Library Quantification

1

Entering edit mode

11.8 years ago

komal.rathi ★ 4.1k

Are there any tools/software (I have already used FASTQC) to determine RNASeq library complexity or duplication levels? I am also trying to use Picard EstimateLibraryComplexity for the same purpose.

rnaseq library • 4.9k views

ADD COMMENT • link updated 7.5 years ago by h.mon 35k • written 11.8 years ago by komal.rathi ★ 4.1k

score 2 · Answer 1 · 2013-10-11

Concerning duplication levels, you can check them with bash script like this:

f=file.fastq; cat $f | awk ' NR%4 == 2 { print $1 }' | sort | uniq -c | sort -gr >>dupl_statistics_${f} &

This will create file where in every row you will see sequence preceded by number of its occurencies. You can then visualise how complex your library is - how many sequences are unique or present two times or thousand times.. (which you cannot infer from FASTQC report, as it puts all sequences present more than 10 times into one column)

If you have bioawk installed, you might use

bioawk -c fastx '{print $seq}'

instead of

awk ' NR%4 == 2 { print $1 }' #assumes that four rows correspond to one read (header, sequence, delimiter, quality information)

score 2 · Answer 2 · 2018-02-28

2

Entering edit mode

7.5 years ago

h.mon 35k

I know this thread is quite old, but dupRadar is missing from the answers, and it is really helpful for quality control of PCR duplication on RNAseq.

ADD COMMENT • link 7.5 years ago by h.mon 35k

score 1 · Answer 3 · 2013-10-11

David Deluca's RNA Seq QC calculates duplication rate. If you are comparing libraries you should downsample your libraries to the same number of read. It is helpful to remove rRNA reads first, too, if you are comparing multiple libraries. rRNA molecule removal rates can vary by library and almost all rRNA reads are duplicates (so short with so many reads) so they skew your results a little.

For library complexity, I like to use rarefaction curves. I usually run my own but if you have read counts per gene you can use my program Scotty and it will run them for you. I think it expects multiple samples so you might have to hack it a little so you have a column for "Gene" and then put two columns for reads per gene. email me if you have trouble.

There is also code in my matlab repository thing on github.

The math is easy though. You get your final read counts per gene. Then, to get what it would have been with a smaller number of reads you can do a binomial sampling for each gene with the probability being whatever fraction of reads you had. So if you have 10 million reads, do a binomial sampling with a probability of 0.1 for each gene at 1 million reads and see how many genes have at least one read. Then two million reads. Then connect all your points.

This is Scotty. http://euler.bc.edu/marthlab/scotty/scotty.php

You can see a rarefaction curve in the help section, the sixth picture down.

We look a library complexity a lot. I'm not entirely sure we have it sorted yet but those are some of the ideas I have.

score 0 · Answer 4 · 2013-10-11

0

Entering edit mode

11.8 years ago

Devon Ryan 105k

FastQC is quite popular and can be easily incorporated into a pipeline. It's also works great for other types of NGS data.

ADD COMMENT • link 11.8 years ago by Devon Ryan 105k

0

Entering edit mode

I have used FastQC but I want to compare the results with other tools available. I should have mentioned it in the question.

ADD REPLY • link 11.8 years ago by komal.rathi ★ 4.1k