Question

Converting mapped + unmapped BAM files to raw FASTQ (RNA-seq data)

3

Entering edit mode

6.2 years ago

rodd ▴ 250

Hi all,

I need to analyse some RNA-seq data with a special aligner for repetitive elements, but the "raw" data from the cohort I am analysing came as aligned BAM files (mapped.bam + unmapped.bam files). I can obtain the raw FASTQ files from a concatenated BAM file, following this tutorial.

However, this is still resulting in reads with a secondary alignment. I was wondering if it would be ok to keep only read pairs in the BAM file which have primary alignments, thus discarding reads which either one of the pair did not align, or reads that have additional alignments (otherwise they would be duplicated in the end FASTQ files). I can't see this being an issue... yet... but please let me know if this sounds correct.

I know there are several posts like this in this and other communities, but I did not manage to find a concise way of doing this yet.

Currently, my concatenated BAM file (mapped + unmapped BAM files) looks like the following:

$ samtools flagstat concatenated.bam

80893332 + 28760 in total (QC-passed reads + QC-failed reads)  
5509466 + 0 secondary 
0 + 0 supplementary 
0 + 0 duplicates 
74608978 + 0 mapped (92.23% : 0.00%) 
75383866 + 28760 paired in sequencing 
37950442 + 14107 read1
37433424 + 14653 read2
34757340 + 0 properly paired (46.11% : 0.00%) 
65502368 + 0 with itself and mate mapped
3597144 + 0 singletons (4.77% : 0.00%)
723510 + 0 with mate mapped to a different chr
429114 + 0 with mate mapped to a different chr (mapQ>=5)

BAM RNA-Seq fastq conversion bam2fastq • 4.7k views

ADD COMMENT • link updated 6.2 years ago by swbarnes2 14k • written 6.2 years ago by rodd ▴ 250

1

Entering edit mode

Hello rodd ,

I would just merge the mapped.bam and unmapped.bam, sort the resulting file by read name using samtools sort -n and extract the reads to fastq using samtools fastq merged_name_sorted.bam|bgzip -c > all.fastq.gz.

See also my issue on samtools github.

fin swimmer

ADD REPLY • link 6.2 years ago by finswimmer 16k

0

Entering edit mode

Hi finswimmer, Thanks you for your prompt response, and for the link to your post on samtools github. I will be following your advice (and the advice from our colleague who also responded to the thread).

But just out of curiosity, I am still finding some discrepancies in the number of reads after converting to fastq. See below number of reads in bam file, and reads in my fastq files:

  $ samtools view -c -F 0x100 merged_sorted.bam        # duplications not allowed
  **75,412,626**
  $ samtools fastq -0 merged_sorted_0.fastq -1 merged_sorted_1.fastq -2 merged_sorted_2.fastq -F 0x100 merged_sorted.bam
  [M::bam2fq_mainloop] discarded 0 singletons
  [M::bam2fq_mainloop] processed **75,412,626 reads**
  $ wc -l merged_sorted_1.fq
  148988112

  148988112 /4 = 37,247,028 reads per file
  37247028 * 2 for paired reads = **74,494,056**

So I have 918,570 reads missing in the fastq files (and they are not in samtools fastq -0 output or -s singletons_output).

ADD REPLY • link updated 6.2 years ago by finswimmer 16k • written 6.2 years ago by rodd ▴ 250

0

Entering edit mode

Your second command isn't filtering out secondary alignments. Don't you want to do that?

ADD REPLY • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

Sorry, that was a typo - I did include it in my command, and have updated my previous reply.

I am only outputting the reads I want to the FASTQ files, which is great. But I am still curious as to why it's removing ~1 mi reads after the BAM-FASTQ conversion, when comparing to the output of samtools view -c -F 0x100 merged_sorted.bam.

ADD REPLY • link 6.2 years ago by rodd ▴ 250

1

Entering edit mode

Count up how often each flag turns up in your bam. Finswimmer's link suggests that that -F 0x900 is turned on whether you want it or not, so maybe that's where your million reads are going.

ADD REPLY • link 6.2 years ago by swbarnes2 14k

score 1 · Answer 1 · 2018-10-30

1

Entering edit mode

6.2 years ago

swbarnes2 14k

I'd do

samtools fastq -F 256 mydata.bam > mydata.fastq

This will discard secondary alignments, so each read, including unmapped reads, will only turn up once in your fastq.

ADD COMMENT • link 6.2 years ago by swbarnes2 14k

0

Entering edit mode

Thank you! (Just for the record, this gave the same number of reads as when I used -F 0x900 or -F 0x100, as suggested by finswimmer. These flags are so confusing...)

ADD REPLY • link 6.2 years ago by rodd ▴ 250

1

Entering edit mode

0x100 is hexadecimal for 256. 0x900 is also filtering for supplementary alignments, you probably don't have any of those.