Question

get a consensus out of a Sorted bam file

1

Entering edit mode

8.3 years ago

lait ▴ 180

I have an SRA file (paired end) with a total number of sequences equal to 4,177,734.

I need to map the reeds to the reference genome.

So what I did was the following:

1-converts SRA to fastq (one for each end)

2- I aligned the paired end reads separately using BWA,

bwa aln ref readset_R1.fq > readset_R1.sai

bwa aln ref readset_R2.fq > readset_R2.sai

3-then combined them

bwa sampe ref readset_R1.sai readset_R2.sai readset_clean_R1.fq readset_clean_R2.fq > readset_ref_bwa.sam

4- then I created the bam sorted file using the following commands

samtools view -S readset_ref_bwa.sam -b -o readset_ref_bwa.bam

samtools sort readset_ref_bwa.bam readset_ref_bwa.sorted.bam

samtools index readset_ref_bwa.sorted.bam

Question 1

if I view the sorted bam file using tview, I would get around 48 reads sorted from longest to shortest. Sice I am new to this domain, is it logical that I got only 48 reads out of the original number of reads that was in the SRA file (4 million +) ?

Question 2

I need to get a consensus out of the sorted bam file. I have read dozens of posts about this but couldnt figure it out.

Many thanks in advance

bam sam NGS BWA • 3.2k views

ADD COMMENT • link 8.3 years ago by lait ▴ 180

0

Entering edit mode

Q1 : count the number of mapped reads : samtools view -c -F 4 in.bam Q2 : but couldnt figure it out. : what's the problem ?

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

thanks for Q1. Regarding Q2, how can this be done ? I wasnt lucky using mpileup. Do you know what is the best way to achieve this ?

ADD REPLY • link 8.3 years ago by lait ▴ 180

0

Entering edit mode

htsbox is marked as experimental by Heng Li (@lh3) but you could try it.

htsbox pileup -f ref.fa -Q20 -q30 -Fs3 sorted.bam

As long as you have the latest samtools and bcftools can you try this?

samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

I tried the second option, is it normal to get something like this at the end of the output file ?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~z~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~x~uuuxx~~~~~~a~~~~~~~}~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~{xxxxuuuuror~~~~~~~~~~~~~~~~~~~~~~x~~~~~~~~~~~~~rK?

ADD REPLY • link 8.3 years ago by lait ▴ 180

0

Entering edit mode

When I try the first option, I get a file that contains just the letter X , like the following:

gi|556550243|ref|NC_022648.1| nnnXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX etc

can anyone figure out the problem please ?

ADD REPLY • link 8.3 years ago by lait ▴ 180

0

Entering edit mode

Obvious question but let me confirm.

Is everything matching in this case as far as the reference? Same fasta genome used for indexing, alignments and subsequent consensus calling? Also using the genbank headers as is may be a bad idea since those pipes in the fasta headers cause issues with various things.

Edit: Can you provide alignment stats (samtools flagstat your.bam)? You are referring to 48 sequences in your original post but not clear what that reference is (48 reads is all you may be seeing in one terminal screen at a time, there has got to be tons more read aligned).

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

yes you are right, way more than 48 sequences

this is what I got:

4177734 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

3841885 + 0 mapped (91.96% : N/A)

4177734 + 0 paired in sequencing

2088867 + 0 read1

2088867 + 0 read2

3804094 + 0 properly paired (91.06% : N/A)

3820194 + 0 with itself and mate mapped

21691 + 0 singletons (0.52% : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

and yes, same fasta genome used.

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by lait ▴ 180