Hi,
I am working with RNA-seq. After mapping a couple of files with paired-end data against a reference genome I have obtained a .bam file from which I have extracted the unmapped reads into another .bam file with:
samtools view -b -f 4 01_BS-S2.bam > unmapped.bam
and, afterwards, I have sorted it with:
samtools sort -n unmapped.bam myout
Then, an inspection of its content with:
samtools fgstat myout.bam
gives the next result:
19052725 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
19052725 + 0 paired in sequencing
9587227 + 0 read1
9465498 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
in which I assume that there are single-end sequences unmapped due to incomplete alignments of some paired-ends (the number of read1 and read2 are different). In order to create a new partial transcriptome I need to convert this myout.bam into fastq files, so I execute:
samtools fastq -1 pe1.fq -2 pe2.fq -s se.fq myout.bam
whose result is (no other output is produced but this):
[M::bam2fq_mainloop_singletontrack] discarded 2125818 singletons
[M::bam2fq_mainloop_singletontrack] processed 19052725 reads
in which the number of processed sequences matches with the output of the flagstat command. However, when I analyze the files p1.fastq and p2.fastq, I find that each of them contains 7920630 sequences, whereas the se.fastq contains the expected 2125818. Therefore, the number of sequences in these three files is: 7920630x2 + 2125818= 17967078 what does not match with the 19052725 processed.
Where are the remaining 1085647 reads (19052725 - 17967078)? Any idea?
Something is wrong - if you start with 19052725 reads, an odd number, then the number of singletons must also be odd. Not 2125818. Otherwise you'll have an odd number of reads left over to be paired... hm.
You must have 1666597 singletons from READ1 and 1544868 form READ2. That makes 3211465 singletons total, hm.
What happens if you run
samtools fastq -1 pe1.fq -2 pe2.fq -s se.fq -f 4 01_BS-S2.bam
?Also, could you run
wc -l *.fq
?I have run out of disk space (sorry for the delay in my replies). To execute
I have had to sort it previously, and I have checked that the amount of sequences matches in both .bam versions (sorted and unsorted). The number of sequences obtained in the .fq files are the same than in my previous message and the mismatch is the same. However, if I analyse the original .bam file with flagstat, I obtain:
which tells that there are 3211465 singletons as 5heikki stated; if we rest to this number (3211465) the number of single-end sequences given by the fastq command (2125818): 3211465 - 2125818 = 1085647 which is exactly the number of missing sequences.
Maybe the .bam files has not been generated in the "standard" way? I am using kallisto whose output is actually a pseudobam. However, up to now, I have been able to use many other tools with these bam files (as IGV) and I interpret the word "pseudbam" in the sense that mappings are not real alignments.