(1) I have been working with a parasite genome assembly using the BWA tool. l used the following command to execute assembly (paired-end Illumina short reads).
(2) l got AA_genome_aln-pe.sam output which is around 50 GB. I also tried to convert this sorted sam file to FASTA format using
samtools bam2fq AA_genome.srt.bam | seqtk seq -A > AA_genome_assembly.fa
However, the final output that l got is in 20 GB. My expected assembly size was approximately 50 MB. How can l get final the assembly in desired output size? Is there still something l am missing in the analysis?
Thank you for your suggestion. I have same question as @kamathshreya70 and managed to get the assembled genomes using SPAdes. My second question is: after getting the assemblies, how should I check the identity and speciation? Since my assembly is a multi-FASTA file (containing 30k plus scaffolds inside), will BLASTN be working? Or do you have any bioinformatics tools to recommend?
Identity against what?, there are many options to compare the similarity between assemblies at genome scale, I would recommend you mummer, to assess the completeness of your assembly you can try BUSCO.
Is there still something l am missing in the analysis?
bwa is an NGS data aligner not a genome assembler. If you are looking to assemble the data then you are using the wrong program. You should be using something like SOAPdenovo, SPAdes if you are looking to assemble your genome starting with (do you only have fasta format data or did you convert the fastq files) sequence data.
If you are aligning to a reference genome (which seems to be the case above) then the size of aligned data file has nothing to do with the size of the genome/assembly. That size is simply reflective of alignments found for your reads against the reference.
You can generate a consensus sequence using the bwa aligned data file (generated consensus should be close in size to your reference). This thread will help with that: Generating consensus sequence from bam file
Hi all,
Thank you for your suggestion. I have same question as @kamathshreya70 and managed to get the assembled genomes using SPAdes. My second question is: after getting the assemblies, how should I check the identity and speciation? Since my assembly is a multi-FASTA file (containing 30k plus scaffolds inside), will BLASTN be working? Or do you have any bioinformatics tools to recommend?
I am looking forwards to your reply. Thank you.
Identity against what?, there are many options to compare the similarity between assemblies at genome scale, I would recommend you mummer, to assess the completeness of your assembly you can try BUSCO.