Rnaseq Library Quantification
4
1
Entering edit mode
11.2 years ago
komal.rathi ★ 4.1k

Are there any tools/software (I have already used FASTQC) to determine RNASeq library complexity or duplication levels? I am also trying to use Picard EstimateLibraryComplexity for the same purpose.

rnaseq library • 4.4k views
ADD COMMENT
2
Entering edit mode
11.2 years ago

Concerning duplication levels, you can check them with bash script like this:

f=file.fastq; cat $f | awk ' NR%4 == 2 { print $1 }' | sort | uniq -c | sort -gr >>dupl_statistics_${f} &

This will create file where in every row you will see sequence preceded by number of its occurencies. You can then visualise how complex your library is - how many sequences are unique or present two times or thousand times.. (which you cannot infer from FASTQC report, as it puts all sequences present more than 10 times into one column)

If you have bioawk installed, you might use

bioawk -c fastx '{print $seq}'

instead of

awk ' NR%4 == 2 { print $1 }' #assumes that four rows correspond to one read (header, sequence, delimiter, quality information)
ADD COMMENT
2
Entering edit mode
6.8 years ago
h.mon 35k

I know this thread is quite old, but dupRadar is missing from the answers, and it is really helpful for quality control of PCR duplication on RNAseq.

ADD COMMENT
1
Entering edit mode
11.2 years ago
Michele Busby ★ 2.2k

David Deluca's RNA Seq QC calculates duplication rate. If you are comparing libraries you should downsample your libraries to the same number of read. It is helpful to remove rRNA reads first, too, if you are comparing multiple libraries. rRNA molecule removal rates can vary by library and almost all rRNA reads are duplicates (so short with so many reads) so they skew your results a little.

For library complexity, I like to use rarefaction curves. I usually run my own but if you have read counts per gene you can use my program Scotty and it will run them for you. I think it expects multiple samples so you might have to hack it a little so you have a column for "Gene" and then put two columns for reads per gene. email me if you have trouble.

There is also code in my matlab repository thing on github.

The math is easy though. You get your final read counts per gene. Then, to get what it would have been with a smaller number of reads you can do a binomial sampling for each gene with the probability being whatever fraction of reads you had. So if you have 10 million reads, do a binomial sampling with a probability of 0.1 for each gene at 1 million reads and see how many genes have at least one read. Then two million reads. Then connect all your points.

This is Scotty. http://euler.bc.edu/marthlab/scotty/scotty.php

You can see a rarefaction curve in the help section, the sixth picture down.

We look a library complexity a lot. I'm not entirely sure we have it sorted yet but those are some of the ideas I have.

ADD COMMENT
0
Entering edit mode

Thanks. I will give it a try and let you know if I face any problems.

ADD REPLY
0
Entering edit mode
11.2 years ago

FastQC is quite popular and can be easily incorporated into a pipeline. It's also works great for other types of NGS data.

ADD COMMENT
0
Entering edit mode

I have used FastQC but I want to compare the results with other tools available. I should have mentioned it in the question.

ADD REPLY

Login before adding your answer.

Traffic: 2414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6