Question

Difference in alignment summary from bam/sam and bed

0

Entering edit mode

7.7 years ago

viv3kanand ▴ 10

Hi,

I was trying to align an RNA-Seq data (paired-end) using bwa. For the alignment summary I used,

samtools flagstat alignment.bam
bamtools stats -in alignment.bam

And both gave me similar results (For example number of reads in pair 1).

Result from Samtools
119825 + 0 read1

Result from bamtools
Read 1:            119825

But when I tried to double check the results using converted bed file using awk, I get a different result.

bamToBed -i alignment.bam > alignment.bed

awk -F "\t" '{print $4}' alignment.bed |awk -F "/" '{if($2==1) print $1}'|sort|uniq|wc -l
119532

Similarly all results are different when I check from bed file. What could be the possible reason for this?

RNA-Seq alignment awk • 5.7k views

ADD COMMENT • link 7.7 years ago by viv3kanand ▴ 10

0

Entering edit mode

You have to provide the conversion command from bam to bed.

ADD REPLY • link 7.7 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Added conversion command.

ADD REPLY • link 7.7 years ago by viv3kanand ▴ 10

0

Entering edit mode

I have no ready explanation, but you could try to see if your awk pipe is working properly by reconverting the bed file to to bam with bedtools bedtobam:

http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html

After re-conversion, check again with flagstat. If you get the same number in the new bam file as the awk pipeline on the bed file then this means that that pipe is for some reason not working 100% correctly. It might be: sam format is highly informative and information-dense, while bed format is less. Therefore you might be losing some info that samtools is using to determine the mapped reads.

ADD REPLY • link 7.7 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

To convert from bed to bam I need the genome file right? I was using a viral reference fasta file. I'm not sure how I would create a genome file from that. eg:

>DENV_env
AATGGAGGTTCTGCTTCTATGTTGACTGGGCTATCTTTTTCTGTCACAATTGGGTTGACTGTAATCAGGCGACCTAAGAC
ATGTCTTTTTTCCAAATCCATTATCTCAAAAGGGATCTTGCATGGAGAGCCGTCCCCTTCATATTGCACTCTGATAACTA
TTGTTCCATGTTGTGTTTCTGCTATTTCCTTCACAACTTTAAACTTTCCTGTGCACATAGAGTATGACATTCCTTTGAGC
TGTAGCTTGTCCATTCTCAGCCTGCACTTGAGATGTCCTGTGAAGAGTAAGTTTCCTGATGACATTTGGATTTCTGTGGC
CCCTGTAAGTGCTGTGTGCATGGCCCCTTCTTGGGATCCTAAAACAACAACATCCTGTTTCTTCGCATGGGGATTTTTGA
AAGTGACCAATGTCTCTTTCTGTATCCAATTTGACCCTTGTGTGTCCGCTCCGGGCAACCATGGTAACGGCAGGTCTAGG
AACCATTGCCTGTGCACCAG

ADD REPLY • link 7.7 years ago by viv3kanand ▴ 10

1

Entering edit mode

I've figured out to how create genome file. Thanks

ADD REPLY • link 7.7 years ago by viv3kanand ▴ 10

0

Entering edit mode

As you said, there is something wrong with the summary now,

Samtools flagstat alignment_converted.bam
214107 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
214107 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

And with bamtools

bamtools stats -in alignment_converted.bam
Total reads:       214107
Mapped reads:      214107   (100%)
Forward strand:    99908    (46.6627%)
Reverse strand:    114199   (53.3373%)
Failed QC:         0    (0%)
Duplicates:        0    (0%)
Paired-end reads:  0    (0%)

ADD REPLY • link 7.7 years ago by viv3kanand ▴ 10

0

Entering edit mode

samtools flagstat will always be correct if the BAM file itself is correct, why are you bothering to "double check" that? You just have to ask yourself whether it makes sense to figure out why awk with a BED file is producing the wrong results (you probably have something better to do).

ADD REPLY • link 7.7 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon,

You are correct. It was not just about double checking the results. I was trying myself to traceback read IDs (of both pairs) from the mapped reads. So that I can check with the barcode list and get to know how many reads mapped to genome has barcode in it. I thought bed file would do it. But it turns out to be giving me wrong results. Would you please suggest me a better option to do this using sam or bam file? (Specifically, is there a flag in sam or bam file to detect pair 1 and pair 2)

PS: I wasn't able to send respond yesterday. Forum was redirecting the page to home whenever I tried.

ADD REPLY • link 7.7 years ago by viv3kanand ▴ 10

0

Entering edit mode

I guess it's not clear to me exactly what your were trying to do. Usually there's one barcode per sample, so everything has the same one.

Anyway, yes, there's a flag for read 1 (64) and another for read 2 (128).

ADD REPLY • link 7.7 years ago by Devon Ryan 105k

0

Entering edit mode

I was looking for flag. Thank you Devon.

ADD REPLY • link 7.7 years ago by viv3kanand ▴ 10