Entering edit mode
11.9 years ago
Pals
★
1.3k
I am working on a RNA-Seq data. The Fastqc shows that data has duplication level of >65%. Can anyone tell me how I can extract the duplicated reads and quantify them? I am also interested to rank those duplicated reads.
Thanks!
Why do you want to extract and quantify them? They are already quantified by FastQC right? Are you worried about the big number? Then you should Duplicated Reads In Rna-Seq Experiment where it is explained that in RNA-seq many of the reads come from ribosomal RNA and highly expressed genes resulting in a high duplication level in your reads
just curious to see the actual reads that are duplicated thousands of times.
gunzip -dc yourFastq.gz | awk '{if(NR%4==2)print $0}' | sort | uniq -c
Thank you very much Irsan. I had to slightly modify your trick awk '{if((NR-2)%4==0)print $0}' :-)
Cool, I changed the comment
Unfortunately, sorting did not go well. I want the most repetitive sequence ranked, where this command sorts alphabetically.
print the reads and then | sort | uniq -c
Thank again, finally its done. gunzip -dc yourFastq.gz | awk '{if(NR%4==2)print $1}' | sort | uniq -c | sort -g
glad you go it working.