Hi,
I am performing reference guided genome assembly of dog genome. I am following a script and I aligned my genome sequenced reads to the canfam 3.1 genome using bowtie2, then converted the sam file to bam file and generated pileup file from this using bcftools. Everything went fine till here.
For generating consensus sequences I used the following script, I don't have the tabix to index the compressed vcf file so I went with bcftools index, do I need to use the tabix to index the vcf file, to generate consensus sequences. Someone please help me with this.
bcftools mpileup -f GCF_000002285.3_CanFam3.1_genomic.fna Chester_Genome_alignments_sorted.bam > Chester_Genome_alignments_output.pileup
bcftools call -mv -Ob -o Chester_Genome_variants.bcf Chester_Genome_alignments_output.pileup
bcftools view Chester_Genome_variants.bcf -O v -o Chester_Genome_variants.vcf
gzip Chester_Genome_variants.vcf
bcftools index Chester_Genome_variants.vcf.gz #This step instead of using tabix I used bcftools index
bcftools consensus -f GCF_000002285.3_CanFam3.1_genomic.fna Chester_Genome_variants.vcf.gz > Chester_genome_consensus_sequence.fa
I need tabix, Could you please help with how to install the tabix.
Why do you think you have to use tabix? Is there an error message? Also, what you are doing is not assembly but variant calling. The final output of your pipeline is the reference genome with all ALT alleles inserted. Just note that your pipeline doesn't include variant filtration and will produce very noisy data.
What happens when you run
which tabix
?1. Indexing of the reference genome- The canfam 3.1
2. Alignment to reference genome
3. Convert sam file to bam file:
4. Sorting and indexing BAM file with samtools
We can use the “bcftools mpileup” command to collapse the pileup of the reads aligned to the reference genomes in the BAM file and then to write only the variant information into a VCF file. In the following, the variant information is written into a BCF file and then that binary file is converted into a VCF file.
The VCF can be compressed using the Linux compression program “bgzip” and then the compressed file is indexed using tabix program, which is a tool for indexing large bioinformatics text files.
The final step in the reference-guided genome assembly is to create a consensus sequence by transferring the sequence reference genome, which was used to create the original BAM file, to the “bcftools consensus” command that utilizes the indexed variant call file to create a new genome sequence for the individual studied. (From The Book I mentioned)
5. Generating Consensus sequences – to form a reference genome for alignment of reads.
I am following this script to perform reference guided genome assembly, from the book BIOINFORMATICS: A Practical Guide to NGS Data Analysis by Hamid D. Ismail.
The only modification I did was not using tabix for indexing of compressed vcf file instead I used the bcftools to index this file. My output consensus file is empty, I did not get any output.
Is there any other way to perform reference guided genome assembly? Please let me know.
Add Comment
orAdd Reply
as appropriate. I've moved your post to the right location this time, please be more careful in the future.code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or use one of (a) the option highlighted in the image below/ (b) fenced code blocks for multi-line code. Fenced code blocks are useful in syntax highlighting. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.Respond to people instead of posting unrelated stuff.
I am very sorry. I am new to this page. I will make sure to not repeat this gain. Thanks.
Side note: Most (if not all) tools that write SAM can also write BAM. Combine bowtie and samtools sort into one step like so:
In principle, this should work in that it should produce some output. You might want to check for errors or interrupted processes. The BAM, VCF files or the genome fasta file could be empty or truncated for example. I don't think it has anything to do with tabix vs. bcftools but if you want to test you can install it via BioConda I think.
This remains a questionable pipeline though: it is not an assembly, the file generated by bcftools albeit named _consensus.fasta is not a consensus, and the pipeline is not state-of-the-art either.
Thank you so much for your response Michael. The VCF file is not empty, I can see the data.
Do you have any other script to perform reference guided genome assembly.
That depends on your downstream application and what kind of input data you have. If you only need the variants, then use a state-of-the-art and highly accurate variant calling pipeline. For human data that would be DeepVariant (if you can install it) or the GATK best-practice workflows for anything else (dog).
On the other hand, if you want to do an assembly to get a new genome, using short reads-only assemblies is a waste of time these days. People may have different views, so if you just want to try it out you can use one of the short-read DeBruijn Graph assemblers (Velvet, IDBA, Abyss) to generate a de novo assembly. Then use tools like RagTag or Ragout to patch the genome given the reference. Use Quast and BUSCO to validate your assembly.
Thank you very much Michael. My work includes identifying cancer specific variants specifically linked with lymphoma in canines.
In this case I think you could use GATK Mutect2. Good luck!