Converting BAM to Fastq - losing reads
3
5
Entering edit mode
6.7 years ago

Hi all,

I am trying to realign a whole genome BAM file from one reference genome to another. The reason for this is that I am interested in HLA regions, and the original reference genome does not include these regions. The process involves converting the name-sorted BAM file to fastq, then realigning the fastq to a new reference.

I seem to be losing reads when converting from BAM to fastq. I have tried a number of ways to do this, including:

  • samtools fastq -1 < file1.fq > -2 < file2.fq > < input.bam >
  • bamToFastq -i < input.bam > -fq < file1.fq > -fq2 < file2.fq >
  • Following the process here:

In each case the number of reads in my output fastq file (counted using wc -l <file> / 4) is slightly less than the original BAM file (counted using samtools flagstat).

When using bamToFastq I get several errors like this:

*****WARNING: Query 6:1219:30638:3260 is marked as paired, but its mate does not occur next to it in your BAM file.  Skipping.

I suspect this is the cause of my read loss. Most of these seem to be in chromosome 6, which is my region of interest. I have tried using samtools fixmate, but still get this same error.

Any ideas would be greatly appreciated!

Many thanks

alignment • 11k views
ADD COMMENT
0
Entering edit mode

Did you check some of the problematic reads on the original bam file? Something like:

samtools view file.bam | grep "6:1219:30638:3260"

It could be useful to add -n to grep to check the line number, specially for the name-ordered files.

ADD REPLY
0
Entering edit mode

Good thought. The output is below for one of the coordinates which gives an error, for a sorted file. I am not great at interpreting these data, but perhaps the problem here is that there are an odd number of reads mapping to these coordinates so they cannot be appropriately paired? Do you know how I could fix this?

samtools view file.bam | grep -n "6:2224:32617:20858"
1748391:6:2224:32617:20858  2145    5   177974622   0   15H23M113H  22  16606471    0   CACCCACCCACACCCCCCCACAC FAFFKKA,,7A,77,,F<<<F,A AS:i:23 XS:i:23 SA:Z:16,86736939,+,52S26M73S,9,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0;  ci:i:1775658    MD:Z:23 NM:i:0  RG:Z:1-C42D3D1
1748392:6:2224:32617:20858  2145    12  130616201   1   31H20M100H  22  16606471    0   CCCACACCCACAACCCCACC    F<<<F,AK7A7,A,AFFA<<    AS:i:20 XS:i:19 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;   ci:i:1775658    MD:Z:20 NM:i:0  RG:Z:1-C42D3D1
1748393:6:2224:32617:20858  2145    16  3539896 0   4H21M126H   22  16606471    0   ACCCCCCCCCCCACCCACCCA   <AAFAFAF<<AFAFFKKA,,7   AS:i:21 XS:i:21 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;12,130616201,+,31S20M100S,1,0;    ci:i:1775658    MD:Z:21 NM:i:0  RG:Z:1-C42D3D1
1748394:6:2224:32617:20858  97  16  86736939    9   52S26M73S   22  16606471    0   AAACACCCCCCCCCCCACCCACCCACACCCCCCCACACCCACAACCCCACCACCCCCACACACCCACACACCCACACAACTGGAGCCCAGCAAGCACCACCCGCCCGACCGCGAAGACAAGCCGAGGAGCAGAGCAGACACGAAAGAAGGG AA<A<AAFAFAF<<AFAFFKKA,,7A,77,,F<<<F,AK7A7,A,AFFA<<,F7,<(,,AAFF7FFK,F,7<,<,7,,,,,,,,777FF,,,,,77,,,,,((((((,,,((,(,,,,7,,,,7<7(,,,,,7,,,,,,,,,,,,,,,,,, AS:i:26 XS:i:23 SA:Z:5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0; ci:i:1775658    MQ:i:0  MS:i:637    MC:i:16606545   MD:Z:26 NM:i:0  RG:Z:1-C42D3D1
1748395:6:2224:32617:20858  145 22  16606471    0   76S20M55S   16  86736939    0   GTTATCAATCACACCCCATCGCCAGATCACCATTCTCAAACTATCCGTCTCCCAGTCTCTAATACATTGGCGTGGGTGCTGCTGCGTTCTGGGTGTCGCCTCTTTCTTGTTCTGCGCTGGGGGCCGCGTGTGATGTTTGGCGTGTTCCGGG ,,,,,,,,,,,,,,,,,,,(,,,,,,,,,,,,,,,A<7,,77,,,,,A,,,,,77,7,,,,,7,,,7,(((,7(7,,,,,(,7A,,,,,A,,,,(,(,,,,,,,,77F7,,(,,(((((,(,(,(,7,7,,,7,,7,,(,,,,7<,,,,,, AS:i:20 XS:i:20 ci:i:1775658    MQ:i:9  MS:i:2157   MC:i:86736887   MD:Z:20 NM:i:0  RG:Z:1-C42D3D1
ADD REPLY
0
Entering edit mode

I don't know exactly how samtools flagstat works, but if it is reporting supplementary alignments (the reads with 2145 flag) on its total number of reads, then it is correct to have a smaller number of reads on your final fastq files.

Do you know if these are DNAseq or RNAseq reads? How were they aligned?

ADD REPLY
0
Entering edit mode

I think you have cracked it - the supplementary alignments are being lost.

However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome. As I understand it, supplementary reads may indicate structural variation; these are cancer samples so I would expect some structural variation. I don't want to lose this information on realigning my sample. The fact that a lot of these supplementary reads are in HLA regions suggest they could have a significant impact on my analysis.

Is it possible to extract to fastq and include these reads? I don't unfortunately have access to the original unaligned fastq files.

ADD REPLY
1
Entering edit mode

As long as you are recovering all unique read identifiers (including their origin R1/R2) that are present in your BAM file there is not much more you can do.

ADD REPLY
0
Entering edit mode

Could it be that you have secondary alignments in your lib_002_map_map.bam file ? This could mess with the whole thing. You can check for secondary alignments using samtools flagstat.

ADD REPLY
0
Entering edit mode

Thanks but I don't think this is it - there are no secondary alignments identified by samtools flagstat

ADD REPLY
0
Entering edit mode

Can you try reformat.sh from BBMap suite instead of bam2fastq?

Something like: reformat.sh in=lib_002_mapped.sort.bam out1=lib_002_mapped.1.fastq out2=lib_002_mapped.2.fastq verifypaired=t primaryonly=t

Additional options you may want to try with original files:

mappedonly=f            Toss unmapped reads.
unmappedonly=f          Toss mapped reads.
pairedonly=f            Toss reads that are not mapped as proper pairs.
unpairedonly=f          Toss reads that are mapped as proper pairs.
primaryonly=f           Toss secondary alignments.  Set this to true for sam to fastq conversion.
ADD REPLY
4
Entering edit mode
6.7 years ago
h.mon 35k

I will summarize the discussion above as an answer:

Looking at the problematic reads (samtools view file.bam | grep -n "6:2224:32617:20858") revealed they are supplementary alignments.

I think you have cracked it - the supplementary alignments are being lost. However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome.

You are right in that supplementary reads may represent structural variants, but supplementary reads are a (partial) copy of the primary reads - at least in the example you selected. If you look at the reads you grepped, all three supplementary reads (flag 2145) are contained within the corresponding primary read (flag 97) - so you are recovering all original reads with your procedure. The primary reads alignment is soft-clipped, so the read is completely represented at the sam record. Supplementary alignments are hard-clipped, so only a fragment of the original read is represented (but I think there are BWA flags that may change this behaviour).

There is a discussion at the samtools github issues page about what are supplementary reads.

ADD COMMENT
1
Entering edit mode
6.7 years ago

The flag of 2145 indicates supplementary (not secondary) alignments I'd filter those out first.

ADD COMMENT
0
Entering edit mode
6.7 years ago

At least with samtools fastq you seem to be forgetting -s, which is where the missing reads would be.

ADD COMMENT
0
Entering edit mode

Thanks Devon, but that's not it. When I add the -s option it returns an empty file. And this doesn't explain the behaviour of bamToFastq.

ADD REPLY

Login before adding your answer.

Traffic: 1868 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6