Question

Is it possible to get a merged fasta/fastq reads from a bam file?

0

Entering edit mode

9.1 years ago

a.s.c • 0

I have a paired end sequencing data of 300 bp long reads with insert size of about 550 bp. I mapped them against the reference obtaining a bam file. Is there an easy way to retrieve a single 550 bp long read for each of the pairs from a bam file?

bam read merge • 2.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by a.s.c • 0

2

Entering edit mode

I think it will help a lot if you explain what your problem is and what you want to achieve in the end, not how you can do something you think is what you want/the solution. See XY-problem. <-- learnt that from John

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by Irsan ★ 7.8k

0

Entering edit mode

I will try to be as explicit as possible.

Given:

reference: 10 fragments of genes of one family, all 550 - 570 bp long
sequencing data: 300 bp paired end sequences of PCR product 550-570 bp long - depending on which gene was sequenced.
mapping: 1 bam file

Expected:

one fasta file with 550 - 570 single end read sequences merged based on the information from bam file locations.

Graphical representation of the problem:

reference:   ATGGTAGGCTCCAGTCAGTCAGTCCGGTAGCAGTAGCAAC
read 1:      ATGGTAaGCTCaAGTCAGTCAGT
read2:       -----------------GTCAGTCCGtTAGCAGTtGCAAC
desired:     ATGGTAaGCTCaAGTCAGTCAGTCCGtTAGCAGTtGCAAC

The problem with FLASH is that it merges reads in this way if the ends of the read are quality trimmed:

ATGGTAaGCTCaAGTCCGtTAGCAGTtGCAAC

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by a.s.c • 0

0

Entering edit mode

What do you want to happen for read pairs in which read1 and read2 map to different members of your gene family of interest?

ADD REPLY • link 7.9 years ago by d-cameron ★ 2.9k

Ram · Answer 1 · 2015-10-13

First, when you select for a certain insert size, what you get is a distribution of insert sizes, centered around the size you aimed - so, some pairs will not merge because insert size was larger than 600, and the merged pairs will range from ~400-575 in range, typically.

There may be some simple way of retrieving single merged reads directly from a bam file, but I am not aware of any.

If you have the original fastq files, you may merge overlapping pairs with FLASH or BBMerge (part of BBTools suite of packages), for example. If you do not have the original fastq files, you may use Reformat (also from BBTools) to convert the bam into fastqs, then again use FLASH or BBMerge to merge overlapping pairs.

Ram · Answer 2 · 2015-10-13

I am not sure if you can sequences of library size. But you can get sequences of read size. For getting reads as fastq, one can run below code (copy/pasted from here)?

Example code:

samtools view -uf64 x.bam |samtools bam2fq - |gzip >x_1.fq.gz
samtools view -uf128 x.bam |samtools bam2fq - |gzip >x_2.fq.gz

or this (copy/pasted from here)

htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz