Question

Why do I lose sequences when converting my bam files into fastq?

1

Entering edit mode

3.5 years ago

DNAngel ▴ 250

Hi all,

I used bamToFastq to convert my bamfiles into R1 and R2 fastq files. I tried samtools fastq before but found it was erroneous (so many missing sequences). However, I still have an issue with bamTofastq from bedtools. When I add the read counts obtained from grep -c "@" R1.fastq and grep -c "@" R2.fastq, it is always slightly less than the count from the bamfile samtools view -c in.bamfile. Why might this be the case? I haven't found any documentation to suggest that this is a normal thing. The R1 and R2 fastq counts should equal the counts in the bamfile so what am I doing wrong with the conversion??

Thank you.

fastq bam bamToFastq • 2.1k views

ADD COMMENT • link updated 3.5 years ago by swbarnes2 14k • written 3.5 years ago by DNAngel ▴ 250

3

Entering edit mode

@ is a valid quality encoding character in FASTQ QUAL lines, so you may want to use the simpler line_count / 4 formula to count number of reads in the FASTQ.

ADD REPLY • link 3.5 years ago by Ram 44k

0

Entering edit mode

Or simply add whatever is next to @ in your grep. e.g. @A00500 (generally sequencer serial). That will only count line 1 from each fastq record.

ADD REPLY • link 3.5 years ago by GenoMax 148k

0

Entering edit mode

Yea I tried both ways, same thing. Just less reads in my fastq files than in my bamfiles.

ADD REPLY • link 3.5 years ago by DNAngel ▴ 250

1

Entering edit mode

3.5 years ago

swbarnes2 14k

Are you absolutely sure that every single read in the bam exists in one and only one line? That you have zero supplemental or secondary reads?

ADD COMMENT • link 3.5 years ago by swbarnes2 14k

0

Entering edit mode

Agreed with this response. I have encountered this in my work; e.g. after STAR alignment, my bam file contains multimapping (e.g. a read that aligns to two locations could appear twice or more in the bam file).

Why don't you try samtools view in.bamfile | cut -f1 | sort -u | wc -l to get a count of the number of unique read names in your bam file?

ADD REPLY • link 3.5 years ago by dsull ★ 7.0k

0

Entering edit mode

Tried this but it cuts out like half of the sequences. I used sambamba to obtain my bamfiles since it is faster than samtools. For the mapped reads I used sambamba view -f bam -F 'not (unmapped or mate_is_unmapped)' in.bam and then the opposite for unmapped reads (just took the not out). samtools view -c in.bam for example gives 436840 reads, but your command gives 218264.

ADD REPLY • link 3.5 years ago by DNAngel ▴ 250

0

Entering edit mode

I checked, I do have secondary alignments. So I am now curious to know how bamToFastq works. When I check for my reads and exclude secondary alignments in my bamfile, the number makes sense. It matches the number of R1 and R2 fastq reads. How does bamToFastq ignore secondary alignments during conversion? I just want to be sure it is okay and I can continue with my fastq files confident that no secondary alignments were included.

ADD REPLY • link 3.5 years ago by DNAngel ▴ 250

0

Entering edit mode

It's likely ignoring all reads which are flagged as secondary and supplemental. This wouldn't work right if you messed around with your bam and removed primary alignments. If you really want to be sure, grep out and count the # of unique read names.

ADD REPLY • link 3.5 years ago by swbarnes2 14k

score 2 · Accepted Answer · 2021-06-11

2

Entering edit mode

3.5 years ago

Pierre Lindenbaum 164k

it is always slightly less than the count from the bamfile samtools view -c in.bamfile. Why might this be the case?

you bam contains supplementary, secondary alignments.

ADD COMMENT • link 3.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

~~https://www.biostars.org/p/9475071/~~ silly me :-P

wanted to link:

losing reads bam to fastq

ADD REPLY • link 3.5 years ago by ATpoint 86k

0

Entering edit mode

Is this something that I can safely ignore then? I used sambamba to get my bamfiles because samtools was just too unbelievably slow.

ADD REPLY • link 3.5 years ago by DNAngel ▴ 250

0

Entering edit mode

Yes you are right. There are secondary alignments present. How does bamToFastq know then to not convert these reads into fastq reads??

ADD REPLY • link 3.5 years ago by DNAngel ▴ 250