Question

Mismatch number of reads after bedtools bam2fastq on RNA-seq data

0

Entering edit mode

7.0 years ago

aditi.qamra ▴ 270

Hi,

I am trying to generate fastqs ( to realign them with different parameters) from RNA-seq bams aligned using STAR (run with --outSAMunmapped Within flag). The original fastq was paired end and stranded and I don't have access to that.

I used bedtools bam2fastq ( bedtools bamtofastq -i $bam -fq R1.fq -fq R2.fq ) to get the fastqs.

I got 113,214,136 reads for each fastq file. No. of reads in the bam file also matches this as the output from samtools view -c $bam is 226,428,272. (2*113,214,136).

samtools flagstat $bam

226428272 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
223692746 + 0 mapped (98.79%:-nan%)
226428272 + 0 paired in sequencing
113214136 + 0 read1
113214136 + 0 read2
223692746 + 0 properly paired (98.79%:-nan%)
223692746 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

However, this is strange because I have the fastqc files and the STAR log of these runs and both showed no. of input reads as 107,979,993. The STAR *.log.final.out is as --

Number of input reads        107,979,993
Average input read length       150
Uniquely mapped reads number        103,168,845
Uniquely mapped reads %         95.54%
Number of reads mapped to multiple loci          3,443,385
% of reads mapped to multiple loci      3.19%
Number of reads mapped to too many loci          36,808 
% of reads mapped to too many loci      0.03%
% of reads unmapped: too many mismatches        0.00%
% of reads unmapped: too short      1.16%
% of reads unmapped: other      0.07%

I can't figure what I am I missing here that no of input reads is less than the no. of reads in the BAM file and regenerated fastq ?!

Thanks!

RNA-Seq alignment • 2.8k views

ADD COMMENT • link updated 7.0 years ago by lakhujanivijay 5.9k • written 7.0 years ago by aditi.qamra ▴ 270

0

Entering edit mode

Hello aditi.qamra,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 7.0 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thanks Vijay! Sorry I dint know about this option before. Will take care

ADD REPLY • link 7.0 years ago by aditi.qamra ▴ 270

score 1 · Answer 1 · 2018-07-07

Probably the extra reads are due to multi-mappers or other secondary alignments being pulled from the bam more than once. If this is undesired, you have to filter the bam to contain only primary alignments, or you may use picard SamToFastq, which by default includes only primary alignments into the output.

Also, did you sort the bam file by name before bedtools bamtofastq? The online documentation states:

When using this option, it is required that the BAM file is sorted/grouped by the read name. This keeps the resulting records in the two output FASTQ files in the same order. One can sort the BAM file by query name with samtools sort -n aln.bam aln.qsort.