Reference Guided genome assembly
0
0
Entering edit mode
10 months ago

Hi,

I am performing reference guided genome assembly of dog genome. I am following a script and I aligned my genome sequenced reads to the canfam 3.1 genome using bowtie2, then converted the sam file to bam file and generated pileup file from this using bcftools. Everything went fine till here.

For generating consensus sequences I used the following script, I don't have the tabix to index the compressed vcf file so I went with bcftools index, do I need to use the tabix to index the vcf file, to generate consensus sequences. Someone please help me with this.

bcftools mpileup -f GCF_000002285.3_CanFam3.1_genomic.fna Chester_Genome_alignments_sorted.bam > Chester_Genome_alignments_output.pileup
bcftools call -mv -Ob -o Chester_Genome_variants.bcf Chester_Genome_alignments_output.pileup
bcftools view Chester_Genome_variants.bcf -O v -o Chester_Genome_variants.vcf
gzip Chester_Genome_variants.vcf
bcftools index Chester_Genome_variants.vcf.gz #This step instead of using tabix I used bcftools index
bcftools consensus -f GCF_000002285.3_CanFam3.1_genomic.fna Chester_Genome_variants.vcf.gz > Chester_genome_consensus_sequence.fa

I need tabix, Could you please help with how to install the tabix.

tabix • 1.4k views
ADD COMMENT
1
Entering edit mode

Why do you think you have to use tabix? Is there an error message? Also, what you are doing is not assembly but variant calling. The final output of your pipeline is the reference genome with all ALT alleles inserted. Just note that your pipeline doesn't include variant filtration and will produce very noisy data.

ADD REPLY
0
Entering edit mode

What happens when you run which tabix?

ADD REPLY
0
Entering edit mode

1. Indexing of the reference genome- The canfam 3.1

bowtie2-build GCF_000002285.3_CanFam3.1_genomic.fna Dog_Referencegenome_Index

2. Alignment to reference genome

bowtie2 -x Dog_Referencegenome_Index -1 Chester-gDNA paired_S12_L004_R1_001.fastq.gz -2 Chester-gDNA-paired_S12_L004_R2_001.fastq.gz  -S Chester_Genome_alignment.sam

3. Convert sam file to bam file:

samtools view -bS Chester_Genome_alignment.sam -o Chester_Genome_alignment.bam

4. Sorting and indexing BAM file with samtools

samtools sort  Chester_Genome_alignment.bam -o Chester_Genome_alignments_sorted.bam
samtools index Chester_Genome_alignments_sorted.bam

We can use the “bcftools mpileup” command to collapse the pileup of the reads aligned to the reference genomes in the BAM file and then to write only the variant information into a VCF file. In the following, the variant information is written into a BCF file and then that binary file is converted into a VCF file.

The VCF can be compressed using the Linux compression program “bgzip” and then the compressed file is indexed using tabix program, which is a tool for indexing large bioinformatics text files.

The final step in the reference-guided genome assembly is to create a consensus sequence by transferring the sequence reference genome, which was used to create the original BAM file, to the “bcftools consensus” command that utilizes the indexed variant call file to create a new genome sequence for the individual studied. (From The Book I mentioned)

5. Generating Consensus sequences – to form a reference genome for alignment of reads.

bcftools mpileup -f GCF_000002285.3_CanFam3.1_genomic.fna Chester_Genome_alignments_sorted.bam > Chester_Genome_alignments_output.pileup
bcftools call -mv -Ob -o Chester_Genome_variants.bcf Chester_Genome_alignments_output.pileup
bcftools view Chester_Genome_variants.bcf -O v -o Chester_Genome_variants.vcf
gzip Chester_Genome_variants.vcf
bcftools index Chester_Genome_variants.vcf.gz
bcftools consensus -f GCF_000002285.3_CanFam3.1_genomic.fna Chester_Genome_variants.vcf.gz > Chester_genome_consensus_sequence.fa

I am following this script to perform reference guided genome assembly, from the book BIOINFORMATICS: A Practical Guide to NGS Data Analysis by Hamid D. Ismail.

The only modification I did was not using tabix for indexing of compressed vcf file instead I used the bcftools to index this file. My output consensus file is empty, I did not get any output.

Is there any other way to perform reference guided genome assembly? Please let me know.

ADD REPLY
0
Entering edit mode
  1. Please do not add answers unless you're answering the top level question. Instead, use Add Comment or Add Reply as appropriate. I've moved your post to the right location this time, please be more careful in the future.
  2. Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or use one of (a) the option highlighted in the image below/ (b) fenced code blocks for multi-line code. Fenced code blocks are useful in syntax highlighting. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
    code_formatting

Respond to people instead of posting unrelated stuff.

ADD REPLY
0
Entering edit mode

I am very sorry. I am new to this page. I will make sure to not repeat this gain. Thanks.

ADD REPLY
0
Entering edit mode

Side note: Most (if not all) tools that write SAM can also write BAM. Combine bowtie and samtools sort into one step like so:

bowtie2 -x Dog_Referencegenome_Index -1 Chester-gDNA paired_S12_L004_R1_001.fastq.gz -2 Chester-gDNA-paired_S12_L004_R2_001.fastq.gz  | samtools sort -o Chester_Genome_alignments_sorted.bam -
ADD REPLY
0
Entering edit mode

In principle, this should work in that it should produce some output. You might want to check for errors or interrupted processes. The BAM, VCF files or the genome fasta file could be empty or truncated for example. I don't think it has anything to do with tabix vs. bcftools but if you want to test you can install it via BioConda I think.

This remains a questionable pipeline though: it is not an assembly, the file generated by bcftools albeit named _consensus.fasta is not a consensus, and the pipeline is not state-of-the-art either.

ADD REPLY
0
Entering edit mode

Thank you so much for your response Michael. The VCF file is not empty, I can see the data.

Do you have any other script to perform reference guided genome assembly.

ADD REPLY
1
Entering edit mode

That depends on your downstream application and what kind of input data you have. If you only need the variants, then use a state-of-the-art and highly accurate variant calling pipeline. For human data that would be DeepVariant (if you can install it) or the GATK best-practice workflows for anything else (dog).

On the other hand, if you want to do an assembly to get a new genome, using short reads-only assemblies is a waste of time these days. People may have different views, so if you just want to try it out you can use one of the short-read DeBruijn Graph assemblers (Velvet, IDBA, Abyss) to generate a de novo assembly. Then use tools like RagTag or Ragout to patch the genome given the reference. Use Quast and BUSCO to validate your assembly.

ADD REPLY
0
Entering edit mode

Thank you very much Michael. My work includes identifying cancer specific variants specifically linked with lymphoma in canines.

ADD REPLY
0
Entering edit mode

In this case I think you could use GATK Mutect2. Good luck!

ADD REPLY

Login before adding your answer.

Traffic: 1601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6