Hello All,
I am working on RNA-Seq paired end data (C.elegans). I wanted to extract unmapped reads, so i mapped the reads to reference genome using tophat2 and STAR aligner. I then merged the two BAM files and used picard remove duplicates. Then I extracted the paired end reads using bamToFastq. Now when check the read IDs, I still find them in multiples rather then just once when I do grep | sort | uniq -c
. Both the forward and reverse reads are included in extracted R1 file and R2 file.
eg.
<<<<<<<<<<<<<<<<<@SRR504339.999983/1>>>>>>>>>>>>>>
@SRR504339.999983/1
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A
@SRR504339.999983/1
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A
@SRR504339.999983/1
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC
@SRR504339.999983/1
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC
<<<<<<<@SRR504339.999983/2>>>>>>>>>>>>
@SRR504339.999983/2
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A
@SRR504339.999983/2
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A
@SRR504339.999983/2
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC
@SRR504339.999983/2
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC
I am not sure how should I remove the duplicates and retain only single entry for reach read.
Any suggestions?
Thanks!
So many questions here :
Why did you use Tophat2 ? Authors of Tophat paper recommend not to use the software anymore.
Why did you use 2 aligners ? One is not enought ?
Would you please show us the picard command line you used ?
Hi @Bastien,
I used more than one aligner to compare and extract maximum unmapped reads!
I used the following picard command:
That's a very strange way to do :) I would rather use STAR only, with setting up some hard filter on STAR options (scoring, alignment & seeding...), if you want something specific.
To answer your question, did you check in your bam files at your reads's flags ? if there are unmapped maybe this is the answer : https://github.com/broadinstitute/picard/pull/1018
Your duplicate reads are actually not only duplicate but have the same names (like a copy/paste, due to your bam merge). I don't know how picard deal with this kind of reads.
Yes I checked the flags, they are 133, 69, 141 and 77!
Well, use this tool : http://www.samformat.info/sam-format-flag
And you'll see that your reads are unmapped
Yes, I know they are unmapped reads, I checked it before as well. But I am looking for an answer to why Picard remove duplicates isn't removing the duplicates in this case.
What append to duplicate mapped reads ?
You can check this link:https://github.com/alvaralmstedt/Tutorials/wiki/Separating-mapped-and-unmapped-reads-from-libraries
Hey bioinfo89, you should have posted this as a comment to your other answer. By posting this as a new answer, it breaks the flow of the thread and makes it somewhat confusing for people arriving here. Leave it for now, though. Another Moderator may see my comment and then move this answer to the top-level.
Did that Kevin.
Thanks for sharing the link @ bioinfo89
Hello bioinfo89,
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!