bwa-mem, from .sam/.bam to fasta file
1
0
Entering edit mode
6.4 years ago
ieie ▴ 10

Dear all, I am trying to do read mapping to construct cp genomes using a reference genome and performing the mapping with bwa-mem.

I am running this command bwa mem -M -B 4 $REF $R1 $R2 > output.bam

-M to mark shorter split as secondary and -B 4 for mismatch penalty. I have two questions:

1)after I have run bwa and my reference genome fasta file is 160kb and and my fastq files are 2.70 GB should I get a bam file of 5 GB or should it be smaller?

2) when I try to convert the bam file into a fasta using samtools is there a way to have the reference genome on the top and then the reads? how should the fasta file look like in order to then use it to annotate the genomes for example in Geneious?

I hope my question are not too vague or basic. thanks a lot

bwa samtools • 4.6k views
ADD COMMENT
1
Entering edit mode

The size of the bam depends on how many reads mapped to the reference.

Also, be aware that bwa mem outputs a SAM file by default, not a BAM.

ADD REPLY
0
Entering edit mode

The size of the bam depends on how many reads mapped to the reference

That is not true. All reads will be at the SAM/BAM file, they simply will be flagged as unmapped. Still, default (and only) output of BWA mem is SAM not BAM. Simply count reads in one of the fastq files (wc -l fastq.fastq), divide this by 4, and compare this with samtools flagstat output.sam

ADD REPLY
0
Entering edit mode

thanks for your answer! I guess my main issue is that after obtaining the output.sam files of my series of reads.fastq I would like to end with a fasta multi alignment file containing the constructed cp genomes and I have looked for a pipeline to do that but with no useful solutions. so I after performing the bwa-mem I can't picture the further steps.

ADD REPLY
1
Entering edit mode

Once you obtain the alignments (do the SAM/BAM conversion etc.) you will need to call variants and then generate the consensus. This process is described on this page.

That said if you have access to Geneious licenses then why don't you do this entire thing in Geneious. Normally people won't recommend commercial products but if you are new to command line it may be much easier to get this done in Geneious.

ADD REPLY
0
Entering edit mode

hi I am doing the sorting of the bam file through:

samtools sort 103308_1116_S42_L007_R2_001.bam > 103308_1116_S42_L007_R2_001sorted.bam

but I keep getting this output:

head 103308_1116_S42_L007_R2_001sorted.bam

?BC?EN? ?0???~b????i?????%?:f?R? A??O6{??p??8?&?>?S?Sl?r??ԭ??X?~?0?v????/?u?_96???BCB=?}?d?uW?xw]?K?jʗ?o???뭊g???RTO?]?3f?t??C??c'?(xE?'$??tHA???D?????? EB|#"!?#AP??E> 8?s????V??[]?w?^????w?s?_~?^??^?????Q?7??????hDe6??r???G?gg??:??uP?ʸLfQ??I?G_5????B?wF?_?׫?jr???|????5?,????x?o??V????z?Z-????w?????xy???͏?????uW??G9??q|OM[/_??>??n??.?????G???'?0???^???/??7??s?>y??W?x|?fI???>?Qi? N??a??A?????´ ??}??|? z??w4???=??Fa9 ?(???4L{o??? ??w?4?'g?&L??B??P??C?du??P?G[???D???:e????q?2u.5u?@?o????qףu???R?;I?G?%? ????f?+?)?S?qp?^:??8?ԡwj?}??|GM]?-:dS?)N?????#??@T)k?$|d~????O[?~?????u?5?gi??"?^b?? *?M?Dy@?fT???9>7G???=?pr?H?3?n ??? S?O?÷x?C?2???>A????F?<l r????5y?%??4zr??="" ?^??????,??t?gyth??????jЯe?="" ???="" anybody="" knows="" why?="" i="" though="" you="" need="" to="" sort="" before="" doing="" the="" variant="" calling="" and="" getting="" the="" consensus="" sequence="" but="" this="" look="" really="" weird.<="" p="">

ADD REPLY
1
Entering edit mode

Of course. BAM is a binary (compressed) version of the SAM format. The default output of BWA is SAM.

bwa mem (options...) > out.sam
samtools sort out.sam -o out_sorted.bam

Or more elegant:

bwa mem (...) | samtools sort -o out_sorted.bam
ADD REPLY
0
Entering edit mode

Right! I run the bwa mem and I got the SAM file, then I thought to convert it in BAM before sorting, thanks a lot!

ADD REPLY
1
Entering edit mode

Also, if you wish to visualise the content of a BAM file you can use "samtools view out.bam"

ADD REPLY
0
Entering edit mode

about variant calling, after sorting I followed the pipeline of the page you gave me and I get a (consensus sequence with the reference genome header, don't know why) and when in the end I run :

cat ~/DNA-Mark/Guibourtia_leonensis_Cp.fas | bcftools consensus calls.vcf.gz > consensus.fa

I get this messages for different sites The site MG564755.1:122 overlaps with another variant, skipping...do you know whether is it an error and if yes what kind of error might be?

ADD REPLY
0
Entering edit mode

Yes, you're right (I was, wrongly, thinking about STAR, which produces by default a sam which contains only the aligned read)

ADD REPLY
0
Entering edit mode

Really, STAR does that? Ok good to know, that means I will never use aligned BAM files from STAR to get back the original reads in fastq format. Thanks for letting me know :-D

ADD REPLY
0
Entering edit mode

With STAR you can use the "--outReadsUnmapped" option to get the unmapped reads in a separate bam. I guess that you can get the original fastq by using both the "Aligned.out.sam" and the "unmapped.bam"

ADD REPLY
2
Entering edit mode
6.4 years ago
GenoMax 147k

For #1 there is no way to predict size of the BAM files (file sizes are never a good measure of anything other than a run failure).

when I try to convert the bam file into a fasta using samtools is there a way to have the reference genome on the top and then the reads? how should the fasta file look like in order to then use it to annotate the genomes for example in Geneious?

Fasta format is a single sequence by definition (for what it is worth). What you are referring to would be an aligned fasta format file, which is not something samtools is going to produce. @Pierre has a tool called SAM4WebLogo that will produce aligned fasta files. I am mentioning it here but it is probably not what you have in mind.

There are other ways of calling a consensus sequence using bcftools which will give you a single fasta sequence. That could be used for annotating in downstream analysis.

ADD COMMENT
0
Entering edit mode

thanks a lot for your answer, really useful!

ADD REPLY

Login before adding your answer.

Traffic: 1036 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6