Question

calculate nucleotides frequencies from mapped reads in a BAM file

0

Entering edit mode

3.5 years ago

xiaoleiusc ▴ 140

Hi, Biostars forum members,

I have a sorted BAM file generated from mRNA-seq experiments (aligned to human genome hg38). I would like to calculate the mono- and dinucleotides frequencies of the mapped reads in the BAM file (e.g. frequencies of As, Cs, Gs, Ts; and frequencies of AAs, ACs, ....). What I did was first convert the BAM file to bed file with bedtools, and then convert the bed file to fasta file with bedtools, and finally calculate the nucleotides frequencies using the generated fasta file from the previous step.

I feel the whole process takes a long time, especially in the step of converting the bed file to the fasta file. I was wondering if there is an easier way to do it? Any input is appreciated.

Thanks ahead,

Xiao

RNA-seq • 2.2k views

ADD COMMENT • link updated 3.5 years ago by Carlo Yague 9.0k • written 3.5 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Try igvtools for single nucleotide counts and for RNAseq data, you need to furnish extra parameter while using IGVtools.

ADD REPLY • link 3.5 years ago by cpad0112 21k

score 1 · Answer 1 · 2021-10-28

1

Entering edit mode

3.5 years ago

Carlo Yague 9.0k

Indeed, there are more direct ways to do it since sequence information is directly available in the bam file. You can try this bam2nuc script. It calculates the (di-)nucleotide coverage of the reads but it uses the genomic sequence rather than the observed sequence in the reads themselves.

If you really need (di-)nucleotide coverage of the reads themselve (not of the genomic reference), you could try to expend these perl/awk one-liners that directly look at the sequence in the bam file: Gc Content From Bam .

ADD COMMENT • link 3.5 years ago by Carlo Yague 9.0k

0

Entering edit mode

Hi, Carlo, thanks for your suggestions with bam2nuc. I run bam2nuc for a sorted bam file (generated by Rsubread and sorted by Rsamtools) aligned to hg38 from mRNA-seq (pair-end). The bam2nuc only calculates the reference genome nucleotides frequencies and then failed with message:

**Determined the file to be single-end

failed to detect valid Bismark FLAG tag: 83**

ADD REPLY • link 3.5 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Looks like the tool thinks your bam file is single-ended, while it is in fact paired-ended according to the flag.

ADD REPLY • link 3.5 years ago by Carlo Yague 9.0k

1

Entering edit mode

Here I paste the developer's response to my problem (from Dr. Felix Krueger.):

I am not entirely certain but I think bam2nuc expects a file that has been aligned with Bismark as input. There have also been issues with the file type auto-detection, which accidentally recognised files as single-end when they actually were paired end (which version were you running?). I suppose it might be able to be an issue that could be 'fixed', but it is really meant to work with Bismark files only...

I thought this could be applied to any sorted bam file with a reference genome but clearly not.

Xiao

ADD REPLY • link 3.5 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Thanks for your feedback on this.

It should be possible to hack bam2nuc to take your bam file as input(for instance by bypassing the file type auto-detection and pre-filtering your bam with the flags bam2nuc expects), but it might be a waste of time considering that you already have a method that works.

ADD REPLY • link 3.5 years ago by Carlo Yague 9.0k