Hi all,
I am trying to realign a whole genome BAM file from one reference genome to another. The reason for this is that I am interested in HLA regions, and the original reference genome does not include these regions. The process involves converting the name-sorted BAM file to fastq, then realigning the fastq to a new reference.
I seem to be losing reads when converting from BAM to fastq. I have tried a number of ways to do this, including:
samtools fastq -1 < file1.fq > -2 < file2.fq > < input.bam >
bamToFastq -i < input.bam > -fq < file1.fq > -fq2 < file2.fq >
- Following the process here:
In each case the number of reads in my output fastq file (counted using wc -l <file> / 4
) is slightly less than the original BAM file (counted using samtools flagstat
).
When using bamToFastq
I get several errors like this:
*****WARNING: Query 6:1219:30638:3260 is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
I suspect this is the cause of my read loss. Most of these seem to be in chromosome 6, which is my region of interest. I have tried using samtools fixmate
, but still get this same error.
Any ideas would be greatly appreciated!
Many thanks
Did you check some of the problematic reads on the original bam file? Something like:
It could be useful to add
-n
to grep to check the line number, specially for the name-ordered files.Good thought. The output is below for one of the coordinates which gives an error, for a sorted file. I am not great at interpreting these data, but perhaps the problem here is that there are an odd number of reads mapping to these coordinates so they cannot be appropriately paired? Do you know how I could fix this?
I don't know exactly how
samtools flagstat
works, but if it is reporting supplementary alignments (the reads with2145
flag) on its total number of reads, then it is correct to have a smaller number of reads on your final fastq files.Do you know if these are DNAseq or RNAseq reads? How were they aligned?
I think you have cracked it - the supplementary alignments are being lost.
However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome. As I understand it, supplementary reads may indicate structural variation; these are cancer samples so I would expect some structural variation. I don't want to lose this information on realigning my sample. The fact that a lot of these supplementary reads are in HLA regions suggest they could have a significant impact on my analysis.
Is it possible to extract to fastq and include these reads? I don't unfortunately have access to the original unaligned fastq files.
As long as you are recovering all unique read identifiers (including their origin R1/R2) that are present in your BAM file there is not much more you can do.
Could it be that you have secondary alignments in your
lib_002_map_map.bam
file ? This could mess with the whole thing. You can check for secondary alignments usingsamtools flagstat.
Thanks but I don't think this is it - there are no secondary alignments identified by samtools flagstat
Can you try
reformat.sh
from BBMap suite instead of bam2fastq?Something like:
reformat.sh in=lib_002_mapped.sort.bam out1=lib_002_mapped.1.fastq out2=lib_002_mapped.2.fastq verifypaired=t primaryonly=t
Additional options you may want to try with original files: