I have a paired end sequencing data of 300 bp long reads with insert size of about 550 bp. I mapped them against the reference obtaining a bam file. Is there an easy way to retrieve a single 550 bp long read for each of the pairs from a bam file?
I have a paired end sequencing data of 300 bp long reads with insert size of about 550 bp. I mapped them against the reference obtaining a bam file. Is there an easy way to retrieve a single 550 bp long read for each of the pairs from a bam file?
First, when you select for a certain insert size, what you get is a distribution of insert sizes, centered around the size you aimed - so, some pairs will not merge because insert size was larger than 600, and the merged pairs will range from ~400-575 in range, typically.
There may be some simple way of retrieving single merged reads directly from a bam file, but I am not aware of any.
If you have the original fastq files, you may merge overlapping pairs with FLASH or BBMerge (part of BBTools suite of packages), for example. If you do not have the original fastq files, you may use Reformat (also from BBTools) to convert the bam into fastqs, then again use FLASH or BBMerge to merge overlapping pairs.
I am not sure if you can sequences of library size. But you can get sequences of read size. For getting reads as fastq, one can run below code (copy/pasted from here)?
Example code:
samtools view -uf64 x.bam |samtools bam2fq - |gzip >x_1.fq.gz
samtools view -uf128 x.bam |samtools bam2fq - |gzip >x_2.fq.gz
or this (copy/pasted from here)
htscmd bamshuf -uOn 128 aln_reads.bam tmp | htscmd bam2fq -a - | gzip > interleaved_reads.fq.gz
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I think it will help a lot if you explain what your problem is and what you want to achieve in the end, not how you can do something you think is what you want/the solution. See XY-problem. <-- learnt that from John
I will try to be as explicit as possible.
Given:
Expected:
Graphical representation of the problem:
The problem with FLASH is that it merges reads in this way if the ends of the read are quality trimmed:
What do you want to happen for read pairs in which read1 and read2 map to different members of your gene family of interest?