Are there any tools/software (I have already used FASTQC) to determine RNASeq library complexity or duplication levels? I am also trying to use Picard EstimateLibraryComplexity for the same purpose.
Are there any tools/software (I have already used FASTQC) to determine RNASeq library complexity or duplication levels? I am also trying to use Picard EstimateLibraryComplexity for the same purpose.
Concerning duplication levels, you can check them with bash script like this:
f=file.fastq; cat $f | awk ' NR%4 == 2 { print $1 }' | sort | uniq -c | sort -gr >>dupl_statistics_${f} &
This will create file where in every row you will see sequence preceded by number of its occurencies. You can then visualise how complex your library is - how many sequences are unique or present two times or thousand times.. (which you cannot infer from FASTQC report, as it puts all sequences present more than 10 times into one column)
If you have bioawk installed, you might use
bioawk -c fastx '{print $seq}'
instead of
awk ' NR%4 == 2 { print $1 }' #assumes that four rows correspond to one read (header, sequence, delimiter, quality information)
David Deluca's RNA Seq QC calculates duplication rate. If you are comparing libraries you should downsample your libraries to the same number of read. It is helpful to remove rRNA reads first, too, if you are comparing multiple libraries. rRNA molecule removal rates can vary by library and almost all rRNA reads are duplicates (so short with so many reads) so they skew your results a little.
For library complexity, I like to use rarefaction curves. I usually run my own but if you have read counts per gene you can use my program Scotty and it will run them for you. I think it expects multiple samples so you might have to hack it a little so you have a column for "Gene" and then put two columns for reads per gene. email me if you have trouble.
There is also code in my matlab repository thing on github.
The math is easy though. You get your final read counts per gene. Then, to get what it would have been with a smaller number of reads you can do a binomial sampling for each gene with the probability being whatever fraction of reads you had. So if you have 10 million reads, do a binomial sampling with a probability of 0.1 for each gene at 1 million reads and see how many genes have at least one read. Then two million reads. Then connect all your points.
This is Scotty. http://euler.bc.edu/marthlab/scotty/scotty.php
You can see a rarefaction curve in the help section, the sixth picture down.
We look a library complexity a lot. I'm not entirely sure we have it sorted yet but those are some of the ideas I have.
FastQC is quite popular and can be easily incorporated into a pipeline. It's also works great for other types of NGS data.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks. I will give it a try and let you know if I face any problems.