Question

Converting BAM to Fastq - losing reads

5

Entering edit mode

6.7 years ago

adampennycuick ▴ 140

Hi all,

I am trying to realign a whole genome BAM file from one reference genome to another. The reason for this is that I am interested in HLA regions, and the original reference genome does not include these regions. The process involves converting the name-sorted BAM file to fastq, then realigning the fastq to a new reference.

I seem to be losing reads when converting from BAM to fastq. I have tried a number of ways to do this, including:

samtools fastq -1 < file1.fq > -2 < file2.fq > < input.bam >
bamToFastq -i < input.bam > -fq < file1.fq > -fq2 < file2.fq >
Following the process here:

In each case the number of reads in my output fastq file (counted using wc -l <file> / 4) is slightly less than the original BAM file (counted using samtools flagstat).

When using bamToFastq I get several errors like this:

*****WARNING: Query 6:1219:30638:3260 is marked as paired, but its mate does not occur next to it in your BAM file.  Skipping.

I suspect this is the cause of my read loss. Most of these seem to be in chromosome 6, which is my region of interest. I have tried using samtools fixmate, but still get this same error.

Any ideas would be greatly appreciated!

Many thanks

alignment • 11k views

ADD COMMENT • link updated 6.7 years ago by h.mon 35k • written 6.7 years ago by adampennycuick ▴ 140

0

Entering edit mode

Did you check some of the problematic reads on the original bam file? Something like:

samtools view file.bam | grep "6:1219:30638:3260"

It could be useful to add -n to grep to check the line number, specially for the name-ordered files.

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

Good thought. The output is below for one of the coordinates which gives an error, for a sorted file. I am not great at interpreting these data, but perhaps the problem here is that there are an odd number of reads mapping to these coordinates so they cannot be appropriately paired? Do you know how I could fix this?

samtools view file.bam | grep -n "6:2224:32617:20858"
1748391:6:2224:32617:20858  2145    5   177974622   0   15H23M113H  22  16606471    0   CACCCACCCACACCCCCCCACAC FAFFKKA,,7A,77,,F<<<F,A AS:i:23 XS:i:23 SA:Z:16,86736939,+,52S26M73S,9,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0;  ci:i:1775658    MD:Z:23 NM:i:0  RG:Z:1-C42D3D1
1748392:6:2224:32617:20858  2145    12  130616201   1   31H20M100H  22  16606471    0   CCCACACCCACAACCCCACC    F<<<F,AK7A7,A,AFFA<<    AS:i:20 XS:i:19 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;   ci:i:1775658    MD:Z:20 NM:i:0  RG:Z:1-C42D3D1
1748393:6:2224:32617:20858  2145    16  3539896 0   4H21M126H   22  16606471    0   ACCCCCCCCCCCACCCACCCA   <AAFAFAF<<AFAFFKKA,,7   AS:i:21 XS:i:21 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;12,130616201,+,31S20M100S,1,0;    ci:i:1775658    MD:Z:21 NM:i:0  RG:Z:1-C42D3D1
1748394:6:2224:32617:20858  97  16  86736939    9   52S26M73S   22  16606471    0   AAACACCCCCCCCCCCACCCACCCACACCCCCCCACACCCACAACCCCACCACCCCCACACACCCACACACCCACACAACTGGAGCCCAGCAAGCACCACCCGCCCGACCGCGAAGACAAGCCGAGGAGCAGAGCAGACACGAAAGAAGGG AA<A<AAFAFAF<<AFAFFKKA,,7A,77,,F<<<F,AK7A7,A,AFFA<<,F7,<(,,AAFF7FFK,F,7<,<,7,,,,,,,,777FF,,,,,77,,,,,((((((,,,((,(,,,,7,,,,7<7(,,,,,7,,,,,,,,,,,,,,,,,, AS:i:26 XS:i:23 SA:Z:5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0; ci:i:1775658    MQ:i:0  MS:i:637    MC:i:16606545   MD:Z:26 NM:i:0  RG:Z:1-C42D3D1
1748395:6:2224:32617:20858  145 22  16606471    0   76S20M55S   16  86736939    0   GTTATCAATCACACCCCATCGCCAGATCACCATTCTCAAACTATCCGTCTCCCAGTCTCTAATACATTGGCGTGGGTGCTGCTGCGTTCTGGGTGTCGCCTCTTTCTTGTTCTGCGCTGGGGGCCGCGTGTGATGTTTGGCGTGTTCCGGG ,,,,,,,,,,,,,,,,,,,(,,,,,,,,,,,,,,,A<7,,77,,,,,A,,,,,77,7,,,,,7,,,7,(((,7(7,,,,,(,7A,,,,,A,,,,(,(,,,,,,,,77F7,,(,,(((((,(,(,(,7,7,,,7,,7,,(,,,,7<,,,,,, AS:i:20 XS:i:20 ci:i:1775658    MQ:i:9  MS:i:2157   MC:i:86736887   MD:Z:20 NM:i:0  RG:Z:1-C42D3D1

ADD REPLY • link updated 6.7 years ago by GenoMax 147k • written 6.7 years ago by adampennycuick ▴ 140

0

Entering edit mode

I don't know exactly how samtools flagstat works, but if it is reporting supplementary alignments (the reads with 2145 flag) on its total number of reads, then it is correct to have a smaller number of reads on your final fastq files.

Do you know if these are DNAseq or RNAseq reads? How were they aligned?

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

I think you have cracked it - the supplementary alignments are being lost.

However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome. As I understand it, supplementary reads may indicate structural variation; these are cancer samples so I would expect some structural variation. I don't want to lose this information on realigning my sample. The fact that a lot of these supplementary reads are in HLA regions suggest they could have a significant impact on my analysis.

Is it possible to extract to fastq and include these reads? I don't unfortunately have access to the original unaligned fastq files.

ADD REPLY • link 6.7 years ago by adampennycuick ▴ 140

1

Entering edit mode

As long as you are recovering all unique read identifiers (including their origin R1/R2) that are present in your BAM file there is not much more you can do.

ADD REPLY • link 6.7 years ago by GenoMax 147k

0

Entering edit mode

Could it be that you have secondary alignments in your lib_002_map_map.bam file ? This could mess with the whole thing. You can check for secondary alignments using samtools flagstat.

ADD REPLY • link 6.7 years ago by Carlo Yague 8.9k

0

Entering edit mode

Thanks but I don't think this is it - there are no secondary alignments identified by samtools flagstat

ADD REPLY • link 6.7 years ago by adampennycuick ▴ 140

0

Entering edit mode

Can you try reformat.sh from BBMap suite instead of bam2fastq?

Something like: reformat.sh in=lib_002_mapped.sort.bam out1=lib_002_mapped.1.fastq out2=lib_002_mapped.2.fastq verifypaired=t primaryonly=t

Additional options you may want to try with original files:

mappedonly=f            Toss unmapped reads.
unmappedonly=f          Toss mapped reads.
pairedonly=f            Toss reads that are not mapped as proper pairs.
unpairedonly=f          Toss reads that are mapped as proper pairs.
primaryonly=f           Toss secondary alignments.  Set this to true for sam to fastq conversion.

ADD REPLY • link 6.7 years ago by GenoMax 147k

score 4 · Answer 1 · 2018-03-15

I will summarize the discussion above as an answer:

Looking at the problematic reads (samtools view file.bam | grep -n "6:2224:32617:20858") revealed they are supplementary alignments.

I think you have cracked it - the supplementary alignments are being lost. However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome.

You are right in that supplementary reads may represent structural variants, but supplementary reads are a (partial) copy of the primary reads - at least in the example you selected. If you look at the reads you grepped, all three supplementary reads (flag 2145) are contained within the corresponding primary read (flag 97) - so you are recovering all original reads with your procedure. The primary reads alignment is soft-clipped, so the read is completely represented at the sam record. Supplementary alignments are hard-clipped, so only a fragment of the original read is represented (but I think there are BWA flags that may change this behaviour).

There is a discussion at the samtools github issues page about what are supplementary reads.

score 1 · Answer 2 · 2018-03-14

1

Entering edit mode

6.7 years ago

swbarnes2 14k

The flag of 2145 indicates supplementary (not secondary) alignments I'd filter those out first.

ADD COMMENT • link 6.7 years ago by swbarnes2 14k

score 0 · Answer 3 · 2018-03-14

0

Entering edit mode

6.7 years ago

Devon Ryan 104k

At least with samtools fastq you seem to be forgetting -s, which is where the missing reads would be.

ADD COMMENT • link 6.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon, but that's not it. When I add the -s option it returns an empty file. And this doesn't explain the behaviour of bamToFastq.

ADD REPLY • link 6.7 years ago by adampennycuick ▴ 140