Extracting paired FASTQ read data from a BAM mapping file

Question

Converting BAM to Fastq - losing reads

5

Entering edit mode

7.4 years ago

adampennycuick ▴ 140

Hi all,

I am trying to realign a whole genome BAM file from one reference genome to another. The reason for this is that I am interested in HLA regions, and the original reference genome does not include these regions. The process involves converting the name-sorted BAM file to fastq, then realigning the fastq to a new reference.

I seem to be losing reads when converting from BAM to fastq. I have tried a number of ways to do this, including:

samtools fastq -1 < file1.fq > -2 < file2.fq > < input.bam >
bamToFastq -i < input.bam > -fq < file1.fq > -fq2 < file2.fq >
Following the process here:
Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.

Extracting paired FASTQ read data from a BAM mapping file

Sometimes FASTQ data is aligned to a reference and stored as a BAM file, instead of the normal FASTQ read files. This is okay, because it is possible to recreate raw FASTQ files based on the BAM file. The following outlines this process. The useful software samtools and bedtools are both required.

From each bam, we need to extract:

reads that mapped properly as pairs

reads that didn’t map properly as pairs (both didn’t map, or one didn’t map)

For #1, the following command will work. This was taken from this webpage.

samtools view -u -f 1 -F 12 lib_002.sorted.md.bam > lib_002_map_map.bam

The -f and -F filter using flags in column 2 of the BAM file. These aren't always intuitive, and I won't describe them more here, but you can use this handy tool to better understand. Also note that the -u flag creates uncompressed BAM output rather than default compressed BAM output, so the files will be larger. This helps with quicker reading in later steps, but it isn't necessary to include this if you want to save disk space. samtools is super fast either way.

Resolving #2 is more complicated, as there are three ways a read might not have mapped as a proper pair. A. The first read mapped but the paired read did not. B. The first read did not map but the paired read did. C. Neither paired read mapped at all. Again, flags will be used to filter the original BAM file. This information was found at this webpage.

# R1 unmapped, R2 mapped samtools view -u -f 4 -F 264 lib_002.sorted.md.bam > lib_002_unmap_map.bam # R1 mapped, R2 unmapped samtools view -u -f 8 -F 260 lib_002.sorted.md.bam > lib_002_map_unmap.bam # R1 & R2 unmapped samtools view -u -f 12 -F 256 lib_002.sorted.md.bam > lib_002_unmap_unmap.bam

As you might expect, you have to then merge the three files that contain at least one unmapped pair.

samtools merge -u lib_002_unmapped.bam lib_002_unmap_map.bam lib_002_map_unmap.bam lib_002_unmap_unmap.bam

Next, these BAM files must be resorted so that they are ordered by read ID instead of location in the reference.

samtools sort -n lib_002_map_map.bam lib_002_mapped.sort samtools sort -n lib_002_unmapped.bam lib_002_unmapped.sort

At this time, it is a good idea to check that you have the correct number of reads and no redundancy. You can summarize the original BAM file to get an idea of where you started.

samtools flagstat lib_002.sorted.md.bam ## output # 160673944 + 0 in total (QC-passed reads + QC-failed reads) # 44648692 + 0 duplicates # 136861465 + 0 mapped (85.18%:-nan%) # 160673944 + 0 paired in sequencing # 80336972 + 0 read1 # 80336972 + 0 read2 # 123173914 + 0 properly paired (76.66%:-nan%) # 123173914 + 0 with itself and mate mapped # 13687551 + 0 singletons (8.52%:-nan%) # 37581548 + 0 with mate mapped to a different chr # 28422566 + 0 with mate mapped to a different chr (mapQ>=5)

Notice the toal number of input reads that is found on the first line. You want to be sure that the number of unmapped and mapped reads total this number. It is easy to check using the following commands.

samtools view -c lib_002_mapped.sort.bam ## output # 123173914 samtools view -c lib_002_unmapped.sort.bam ## output # 37500030

Note that one paired read is counted as two reads here. If you sum these two numbers, they should equal the number you noted above, as they do here.

If all is good, you can now extract the FASTQ reads into two paired read files, as follows.

bamToFastq -i lib_002_mapped.sort.bam -fq lib_002_mapped.1.fastq -fq2 lib_002_mapped.2.fastq bamToFastq -i lib_002_unmapped.sort.bam -fq lib_002_unmapped.1.fastq -fq2 lib_002_unmapped.2.fastq

And then it also makes sense to combine both the first and paired reads together from the mapped and unmapped files.

cat lib_002_mapped.1.fastq lib_002_unmapped.1.fastq > lib_002.1.fastq cat lib_002_mapped.2.fastq lib_002_unmapped.2.fastq > lib_002.2.fastq

These two files should now have the same number of reads that are exactly as you would have received them if they had come directly from the sequencer as FASTQ.

Please also note that all of the commands above can be piped together in bash using |, which will save on disk space and time. So it is best to combine commands where possible.
view raw extract_fastq_bam.md hosted with ❤ by GitHub

In each case the number of reads in my output fastq file (counted using wc -l <file> / 4) is slightly less than the original BAM file (counted using samtools flagstat).

When using bamToFastq I get several errors like this:

*****WARNING: Query 6:1219:30638:3260 is marked as paired, but its mate does not occur next to it in your BAM file.  Skipping.

I suspect this is the cause of my read loss. Most of these seem to be in chromosome 6, which is my region of interest. I have tried using samtools fixmate, but still get this same error.

Any ideas would be greatly appreciated!

Many thanks

alignment • 13k views

ADD COMMENT • link updated 7.4 years ago by h.mon 35k • written 7.4 years ago by adampennycuick ▴ 140

0

Entering edit mode

Did you check some of the problematic reads on the original bam file? Something like:

samtools view file.bam | grep "6:1219:30638:3260"

It could be useful to add -n to grep to check the line number, specially for the name-ordered files.

ADD REPLY • link 7.4 years ago by h.mon 35k

0

Entering edit mode

Good thought. The output is below for one of the coordinates which gives an error, for a sorted file. I am not great at interpreting these data, but perhaps the problem here is that there are an odd number of reads mapping to these coordinates so they cannot be appropriately paired? Do you know how I could fix this?

samtools view file.bam | grep -n "6:2224:32617:20858"
1748391:6:2224:32617:20858  2145    5   177974622   0   15H23M113H  22  16606471    0   CACCCACCCACACCCCCCCACAC FAFFKKA,,7A,77,,F<<<F,A AS:i:23 XS:i:23 SA:Z:16,86736939,+,52S26M73S,9,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0;  ci:i:1775658    MD:Z:23 NM:i:0  RG:Z:1-C42D3D1
1748392:6:2224:32617:20858  2145    12  130616201   1   31H20M100H  22  16606471    0   CCCACACCCACAACCCCACC    F<<<F,AK7A7,A,AFFA<<    AS:i:20 XS:i:19 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;   ci:i:1775658    MD:Z:20 NM:i:0  RG:Z:1-C42D3D1
1748393:6:2224:32617:20858  2145    16  3539896 0   4H21M126H   22  16606471    0   ACCCCCCCCCCCACCCACCCA   <AAFAFAF<<AFAFFKKA,,7   AS:i:21 XS:i:21 SA:Z:16,86736939,+,52S26M73S,9,0;5,177974622,+,15S23M113S,0,0;12,130616201,+,31S20M100S,1,0;    ci:i:1775658    MD:Z:21 NM:i:0  RG:Z:1-C42D3D1
1748394:6:2224:32617:20858  97  16  86736939    9   52S26M73S   22  16606471    0   AAACACCCCCCCCCCCACCCACCCACACCCCCCCACACCCACAACCCCACCACCCCCACACACCCACACACCCACACAACTGGAGCCCAGCAAGCACCACCCGCCCGACCGCGAAGACAAGCCGAGGAGCAGAGCAGACACGAAAGAAGGG AA<A<AAFAFAF<<AFAFFKKA,,7A,77,,F<<<F,AK7A7,A,AFFA<<,F7,<(,,AAFF7FFK,F,7<,<,7,,,,,,,,777FF,,,,,77,,,,,((((((,,,((,(,,,,7,,,,7<7(,,,,,7,,,,,,,,,,,,,,,,,, AS:i:26 XS:i:23 SA:Z:5,177974622,+,15S23M113S,0,0;16,3539896,+,4S21M126S,0,0;12,130616201,+,31S20M100S,1,0; ci:i:1775658    MQ:i:0  MS:i:637    MC:i:16606545   MD:Z:26 NM:i:0  RG:Z:1-C42D3D1
1748395:6:2224:32617:20858  145 22  16606471    0   76S20M55S   16  86736939    0   GTTATCAATCACACCCCATCGCCAGATCACCATTCTCAAACTATCCGTCTCCCAGTCTCTAATACATTGGCGTGGGTGCTGCTGCGTTCTGGGTGTCGCCTCTTTCTTGTTCTGCGCTGGGGGCCGCGTGTGATGTTTGGCGTGTTCCGGG ,,,,,,,,,,,,,,,,,,,(,,,,,,,,,,,,,,,A<7,,77,,,,,A,,,,,77,7,,,,,7,,,7,(((,7(7,,,,,(,7A,,,,,A,,,,(,(,,,,,,,,77F7,,(,,(((((,(,(,(,7,7,,,7,,7,,(,,,,7<,,,,,, AS:i:20 XS:i:20 ci:i:1775658    MQ:i:9  MS:i:2157   MC:i:86736887   MD:Z:20 NM:i:0  RG:Z:1-C42D3D1

ADD REPLY • link updated 7.4 years ago by GenoMax 153k • written 7.4 years ago by adampennycuick ▴ 140

0

Entering edit mode

I don't know exactly how samtools flagstat works, but if it is reporting supplementary alignments (the reads with 2145 flag) on its total number of reads, then it is correct to have a smaller number of reads on your final fastq files.

Do you know if these are DNAseq or RNAseq reads? How were they aligned?

ADD REPLY • link 7.4 years ago by h.mon 35k

0

Entering edit mode

I think you have cracked it - the supplementary alignments are being lost.

However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome. As I understand it, supplementary reads may indicate structural variation; these are cancer samples so I would expect some structural variation. I don't want to lose this information on realigning my sample. The fact that a lot of these supplementary reads are in HLA regions suggest they could have a significant impact on my analysis.

Is it possible to extract to fastq and include these reads? I don't unfortunately have access to the original unaligned fastq files.

ADD REPLY • link 7.4 years ago by adampennycuick ▴ 140

1

Entering edit mode

As long as you are recovering all unique read identifiers (including their origin R1/R2) that are present in your BAM file there is not much more you can do.

ADD REPLY • link 7.4 years ago by GenoMax 153k

0

Entering edit mode

Could it be that you have secondary alignments in your lib_002_map_map.bam file ? This could mess with the whole thing. You can check for secondary alignments using samtools flagstat.

ADD REPLY • link 7.4 years ago by Carlo Yague 9.0k

0

Entering edit mode

Thanks but I don't think this is it - there are no secondary alignments identified by samtools flagstat

ADD REPLY • link 7.4 years ago by adampennycuick ▴ 140

0

Entering edit mode

Can you try reformat.sh from BBMap suite instead of bam2fastq?

Something like: reformat.sh in=lib_002_mapped.sort.bam out1=lib_002_mapped.1.fastq out2=lib_002_mapped.2.fastq verifypaired=t primaryonly=t

Additional options you may want to try with original files:

mappedonly=f            Toss unmapped reads.
unmappedonly=f          Toss mapped reads.
pairedonly=f            Toss reads that are not mapped as proper pairs.
unpairedonly=f          Toss reads that are mapped as proper pairs.
primaryonly=f           Toss secondary alignments.  Set this to true for sam to fastq conversion.

ADD REPLY • link 7.4 years ago by GenoMax 153k

score 4 · Answer 1 · 2018-03-15

I will summarize the discussion above as an answer:

Looking at the problematic reads (samtools view file.bam | grep -n "6:2224:32617:20858") revealed they are supplementary alignments.

I think you have cracked it - the supplementary alignments are being lost. However, I don't think this is the behaviour that I want. These are DNAseq reads aligned using bwa mem. I want to extract ALL reads to fastq and realign to a new reference genome.

You are right in that supplementary reads may represent structural variants, but supplementary reads are a (partial) copy of the primary reads - at least in the example you selected. If you look at the reads you grepped, all three supplementary reads (flag 2145) are contained within the corresponding primary read (flag 97) - so you are recovering all original reads with your procedure. The primary reads alignment is soft-clipped, so the read is completely represented at the sam record. Supplementary alignments are hard-clipped, so only a fragment of the original read is represented (but I think there are BWA flags that may change this behaviour).

There is a discussion at the samtools github issues page about what are supplementary reads.

score 1 · Answer 2 · 2018-03-14

1

Entering edit mode

7.4 years ago

swbarnes2 15k

The flag of 2145 indicates supplementary (not secondary) alignments I'd filter those out first.

ADD COMMENT • link 7.4 years ago by swbarnes2 15k

score 0 · Answer 3 · 2018-03-14

0

Entering edit mode

7.4 years ago

Devon Ryan 105k

At least with samtools fastq you seem to be forgetting -s, which is where the missing reads would be.

ADD COMMENT • link 7.4 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks Devon, but that's not it. When I add the -s option it returns an empty file. And this doesn't explain the behaviour of bamToFastq.

ADD REPLY • link 7.4 years ago by adampennycuick ▴ 140