Question

Missing reads in BAM to Fastq conversion

1

Entering edit mode

8.6 years ago

sergiogalvezrojas ▴ 30

Hi,

I am working with RNA-seq. After mapping a couple of files with paired-end data against a reference genome I have obtained a .bam file from which I have extracted the unmapped reads into another .bam file with:

samtools view -b -f 4 01_BS-S2.bam > unmapped.bam

and, afterwards, I have sorted it with:

samtools sort -n unmapped.bam myout

Then, an inspection of its content with:

samtools fgstat myout.bam

gives the next result:

19052725 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
19052725 + 0 paired in sequencing
9587227 + 0 read1
9465498 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

in which I assume that there are single-end sequences unmapped due to incomplete alignments of some paired-ends (the number of read1 and read2 are different). In order to create a new partial transcriptome I need to convert this myout.bam into fastq files, so I execute:

samtools fastq -1 pe1.fq -2 pe2.fq -s se.fq myout.bam

whose result is (no other output is produced but this):

[M::bam2fq_mainloop_singletontrack] discarded 2125818 singletons
[M::bam2fq_mainloop_singletontrack] processed 19052725 reads

in which the number of processed sequences matches with the output of the flagstat command. However, when I analyze the files p1.fastq and p2.fastq, I find that each of them contains 7920630 sequences, whereas the se.fastq contains the expected 2125818. Therefore, the number of sequences in these three files is: 7920630x2 + 2125818= 17967078 what does not match with the 19052725 processed.

Where are the remaining 1085647 reads (19052725 - 17967078)? Any idea?

RNA-Seq samtools • 5.4k views

ADD COMMENT • link updated 7.5 years ago by wongfay0207 • 0 • written 8.6 years ago by sergiogalvezrojas ▴ 30

1

Entering edit mode

Something is wrong - if you start with 19052725 reads, an odd number, then the number of singletons must also be odd. Not 2125818. Otherwise you'll have an odd number of reads left over to be paired... hm.

You must have 1666597 singletons from READ1 and 1544868 form READ2. That makes 3211465 singletons total, hm.

What happens if you run samtools fastq -1 pe1.fq -2 pe2.fq -s se.fq -f 4 01_BS-S2.bam?

Also, could you run wc -l *.fq ?

ADD REPLY • link 8.6 years ago by John 13k

1

Entering edit mode

I have run out of disk space (sorry for the delay in my replies). To execute

samtools fastq -1 pe1.fq -2 pe2.fq -s se.fq -f 4 01_BS-S2.bam

I have had to sort it previously, and I have checked that the amount of sequences matches in both .bam versions (sorted and unsorted). The number of sequences obtained in the .fq files are the same than in my previous message and the mismatch is the same. However, if I analyse the original .bam file with flagstat, I obtain:

59333204 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
40280479 + 0 mapped (67.89%:-nan%)
59333204 + 0 paired in sequencing
29666602 + 0 read1
29666602 + 0 read2
37069014 + 0 properly paired (62.48%:-nan%)
37069014 + 0 with itself and mate mapped
3211465 + 0 singletons (5.41%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

which tells that there are 3211465 singletons as 5heikki stated; if we rest to this number (3211465) the number of single-end sequences given by the fastq command (2125818): 3211465 - 2125818 = 1085647 which is exactly the number of missing sequences.

Maybe the .bam files has not been generated in the "standard" way? I am using kallisto whose output is actually a pseudobam. However, up to now, I have been able to use many other tools with these bam files (as IGV) and I interpret the word "pseudbam" in the sense that mappings are not real alignments.

ADD REPLY • link 8.6 years ago by sergiogalvezrojas ▴ 30

score 1 · Answer 1 · 2016-05-04

1

Entering edit mode

8.6 years ago

5heikki 11k

Perhaps these reads correspond to the -0 option..

Usage: samtools fastq [options...] <in.bam>
Options:
  -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
  -1 FILE   write paired reads flagged READ1 to FILE
  -2 FILE   write paired reads flagged READ2 to FILE
  -f INT    only include reads with all bits set in INT set in FLAG [0]
  -F INT    only include reads with none of the bits set in INT set in FLAG [0]
  -n        don't append /1 and /2 to the read name
  -O        output quality in the OQ tag if present
  -s FILE   write singleton reads to FILE [assume single-end]
  -t        copy RG, BC and QT tags to the FASTQ header line
  -v INT    default quality score if not given in file [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

ADD COMMENT • link 8.6 years ago by 5heikki 11k

1

Entering edit mode

Yes, it's not very clear how -0 differs from -s :/

Sounds to me like -0 reads are reads which are not singletons, but SE reads (or singletons after FixMateInformation)

ADD REPLY • link 8.6 years ago by John 13k

1

Entering edit mode

Well, I have executed the command with the -0 file (just in case) but I obtain a file whose length is 0 (nothing inside).

ADD REPLY • link 8.6 years ago by sergiogalvezrojas ▴ 30

0

Entering edit mode

Yeah that's what I would have thought, since it sounds to me like its for SE reads. In that case try the other stuff I mentioned above :)

ADD REPLY • link 8.6 years ago by John 13k

0

Entering edit mode

What does -0 do in your samtools? I noticed you're running some old version because fgstat instead of flagstat in the other command (or maybe it's just a typo)..

ADD REPLY • link 8.6 years ago by 5heikki 11k

0

Entering edit mode

Yes, it was a typo. I have tried to use many tools and I track a log of them. Currently, I am using the version 1.3.1 in Windows thru Cygwin, and the version 1.19 in a SLURM cluster when I need more speed. However, the results I have posted correspond conpletely to the 1.3.1 version.

ADD REPLY • link 8.6 years ago by sergiogalvezrojas ▴ 30

score 0 · Answer 2 · 2017-07-07

0

Entering edit mode

7.5 years ago

wongfay0207 • 0

I ran into the same problem. The reads in short ranges from 100M-200M. Seems only 10% reads have been output into fq files though claim has processed the whole bam files. Don't know the criteria how those reads were selected, with all setting s default. What is the solution here?

ADD COMMENT • link 7.5 years ago by wongfay0207 • 0

0

Entering edit mode

Please do not use SUBMIT ANSWER to add new questions to an existing thread. That is reserved for new answers to the original question at the top of this thread.

In addition, your post is light on useful details. You have not said what program was used, what the output refers to and what is "short ranges". Please clarify all those things and ideally post this as a new thread if it is not directly related to this one.

ADD REPLY • link 7.5 years ago by GenoMax 148k