How to extract unique paired end reads from a merged BAM file?
1
0
Entering edit mode
6.6 years ago
bioinfo89 ▴ 60

Hello All,

I am working on RNA-Seq paired end data (C.elegans). I wanted to extract unmapped reads, so i mapped the reads to reference genome using tophat2 and STAR aligner. I then merged the two BAM files and used picard remove duplicates. Then I extracted the paired end reads using bamToFastq. Now when check the read IDs, I still find them in multiples rather then just once when I do grep | sort | uniq -c. Both the forward and reverse reads are included in extracted R1 file and R2 file.

eg.

 <<<<<<<<<<<<<<<<<@SRR504339.999983/1>>>>>>>>>>>>>>

@SRR504339.999983/1
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/1
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/1
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

@SRR504339.999983/1
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

<<<<<<<@SRR504339.999983/2>>>>>>>>>>>>

@SRR504339.999983/2
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/2
TCAGATCCGAGAGGTCTCGATCGCTCCACCGCGGCGTGCGCCGTGTAAGCATAGGGCGCAACTTCGACCAGTATCAGGGCCCGTGATT
+
HEEG9C@?FFGFHHFBDDF<FFHIGBGG>@BBEF<@=B8@5@BD@B9:>CAC>CCB9>5-5&+:>?<?BDB@@DCA@CB8?>>9@0?A

@SRR504339.999983/2
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

@SRR504339.999983/2
GGTTTAAGTCTAGGACCGGACAAACGGTACGATTACGTCTTAGCATCAAGTGGTGCCTGATCTCGTGATAGACAACGTGAGAGTATTT
+
FHFHICACHFHBGGGHEFH:?@GGGFF7=@BHIG;@EHGHD@DE@DCD@CECCDDCCDC<C@@@BDD5>:ACCDCB@?B8?@A@3>BC

I am not sure how should I remove the duplicates and retain only single entry for reach read.

Any suggestions?

Thanks!

RNA-Seq • 3.1k views
ADD COMMENT
1
Entering edit mode

So many questions here :

Why did you use Tophat2 ? Authors of Tophat paper recommend not to use the software anymore.

Why did you use 2 aligners ? One is not enought ?

Would you please show us the picard command line you used ?

ADD REPLY
0
Entering edit mode

Hi @Bastien,

I used more than one aligner to compare and extract maximum unmapped reads!

I used the following picard command:

$picard_dup I=unmapped_merged.sorted.bam O=unmapped_merged.sorted.rmdup.bam M=unmapped_merged_rmdup REMOVE_DUPLICATES=true AS=true VALIDATION_STRINGENCY=LENIENT
ADD REPLY
0
Entering edit mode

That's a very strange way to do :) I would rather use STAR only, with setting up some hard filter on STAR options (scoring, alignment & seeding...), if you want something specific.

To answer your question, did you check in your bam files at your reads's flags ? if there are unmapped maybe this is the answer : https://github.com/broadinstitute/picard/pull/1018

Your duplicate reads are actually not only duplicate but have the same names (like a copy/paste, due to your bam merge). I don't know how picard deal with this kind of reads.

ADD REPLY
0
Entering edit mode

Yes I checked the flags, they are 133, 69, 141 and 77!

ADD REPLY
0
Entering edit mode

Well, use this tool : http://www.samformat.info/sam-format-flag

And you'll see that your reads are unmapped

ADD REPLY
0
Entering edit mode

Yes, I know they are unmapped reads, I checked it before as well. But I am looking for an answer to why Picard remove duplicates isn't removing the duplicates in this case.

ADD REPLY
0
Entering edit mode

What append to duplicate mapped reads ?

ADD REPLY
2
Entering edit mode

Hey bioinfo89, you should have posted this as a comment to your other answer. By posting this as a new answer, it breaks the flow of the thread and makes it somewhat confusing for people arriving here. Leave it for now, though. Another Moderator may see my comment and then move this answer to the top-level.

ADD REPLY
1
Entering edit mode

Did that Kevin.

ADD REPLY
1
Entering edit mode

Thanks for sharing the link @ bioinfo89

ADD REPLY
1
Entering edit mode

Hello bioinfo89,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
1
Entering edit mode
6.6 years ago
bioinfo89 ▴ 60

A found a simple solution! Used seqtk subseq extract all the unique unmapped reads!

ADD COMMENT
1
Entering edit mode

could you please post seqtk solution for future reference?

ADD REPLY
2
Entering edit mode

summary of the solution is:

  1. Use samtools to get mapped and unmapped sams using -F and -f tags
  2. Get read names from the first column of sams
  3. Use read names (from above step) to extract reads from fastq files, for mapped and unmapped. For this seqtk subseq can be used.
ADD REPLY
0
Entering edit mode

You may find seqkit useful ..

refer the becnhmarking - https://bioinf.shenwei.me/seqkit/#benchmark

ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6