Question

How To Extract And Quantify Duplicated Rna-Seq Reads?

1

Entering edit mode

12.4 years ago

Pals ★ 1.3k

I am working on a RNA-Seq data. The Fastqc shows that data has duplication level of >65%. Can anyone tell me how I can extract the duplicated reads and quantify them? I am also interested to rank those duplicated reads.

Thanks!

rna-seq • 4.4k views

ADD COMMENT • link updated 12.4 years ago by Zev.Kronenberg 12k • written 12.4 years ago by Pals ★ 1.3k

0

Entering edit mode

Why do you want to extract and quantify them? They are already quantified by FastQC right? Are you worried about the big number? Then you should Duplicated Reads In Rna-Seq Experiment where it is explained that in RNA-seq many of the reads come from ribosomal RNA and highly expressed genes resulting in a high duplication level in your reads

ADD REPLY • link 12.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

just curious to see the actual reads that are duplicated thousands of times.

ADD REPLY • link 12.4 years ago by Pals ★ 1.3k

1

Entering edit mode

gunzip -dc yourFastq.gz | awk '{if(NR%4==2)print $0}' | sort | uniq -c

ADD REPLY • link 12.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

Thank you very much Irsan. I had to slightly modify your trick awk '{if((NR-2)%4==0)print $0}' :-)

ADD REPLY • link 12.4 years ago by Pals ★ 1.3k

1

Entering edit mode

Cool, I changed the comment

ADD REPLY • link 12.4 years ago by Irsan ★ 7.8k

0

Entering edit mode

Unfortunately, sorting did not go well. I want the most repetitive sequence ranked, where this command sorts alphabetically.

ADD REPLY • link 12.4 years ago by Pals ★ 1.3k

1

Entering edit mode

print the reads and then | sort | uniq -c

ADD REPLY • link 12.4 years ago by Zev.Kronenberg 12k

1

Entering edit mode

Thank again, finally its done. gunzip -dc yourFastq.gz | awk '{if(NR%4==2)print $1}' | sort | uniq -c | sort -g

ADD REPLY • link 12.4 years ago by Pals ★ 1.3k

0

Entering edit mode

glad you go it working.

ADD REPLY • link 12.4 years ago by Zev.Kronenberg 12k

score 5 · Accepted Answer · 2013-02-14

5

Entering edit mode

12.4 years ago

Zev.Kronenberg 12k

After marking duplicates (with your favorite program) you can use the samtools flag to pull out the reads you want.

Just set the include flag: samtools view -f 0x400:

0x400 PCR or optical duplicate

ADD COMMENT • link 12.4 years ago by Zev.Kronenberg 12k