I have an SRA file (paired end) with a total number of sequences equal to 4,177,734.
I need to map the reeds to the reference genome.
So what I did was the following:
1-converts SRA to fastq (one for each end)
2- I aligned the paired end reads separately using BWA,
bwa aln ref readset_R1.fq > readset_R1.sai
bwa aln ref readset_R2.fq > readset_R2.sai
3-then combined them
bwa sampe ref readset_R1.sai readset_R2.sai readset_clean_R1.fq readset_clean_R2.fq > readset_ref_bwa.sam
4- then I created the bam sorted file using the following commands
samtools view -S readset_ref_bwa.sam -b -o readset_ref_bwa.bam
samtools sort readset_ref_bwa.bam readset_ref_bwa.sorted.bam
samtools index readset_ref_bwa.sorted.bam
Question 1
if I view the sorted bam file using tview, I would get around 48 reads sorted from longest to shortest. Sice I am new to this domain, is it logical that I got only 48 reads out of the original number of reads that was in the SRA file (4 million +) ?
Question 2
I need to get a consensus out of the sorted bam file. I have read dozens of posts about this but couldnt figure it out.
Many thanks in advance
Q1 : count the number of mapped reads :
samtools view -c -F 4 in.bam
Q2 :but couldnt figure it out.
: what's the problem ?thanks for Q1. Regarding Q2, how can this be done ? I wasnt lucky using mpileup. Do you know what is the best way to achieve this ?
htsbox is marked as experimental by Heng Li (@lh3) but you could try it.
As long as you have the latest samtools and bcftools can you try this?
I tried the second option, is it normal to get something like this at the end of the output file ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~z~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~x~uu
ux
x~~~~~~a~~~~~~~}~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~{xxxxuuuuror~~~~~~~~~~~~~~~~~~~~~~x~~~~~~~~~~~~~rK?When I try the first option, I get a file that contains just the letter X , like the following:
can anyone figure out the problem please ?
Obvious question but let me confirm.
Is everything matching in this case as far as the reference? Same fasta genome used for indexing, alignments and subsequent consensus calling? Also using the genbank headers as is may be a bad idea since those pipes in the fasta headers cause issues with various things.
Edit: Can you provide alignment stats (
samtools flagstat your.bam
)? You are referring to 48 sequences in your original post but not clear what that reference is (48 reads is all you may be seeing in one terminal screen at a time, there has got to be tons more read aligned).yes you are right, way more than 48 sequences
this is what I got:
and yes, same fasta genome used.