Hi,
One of our collaborators lab recently isolated virus samples from patients. I received the sequenced reads from MiSeq for those virus samples. Then I investigated the SNVs present in the samples. Now I am interested in creating a phylogenetic tree for Neuraminidase gene with the other NA sequence in the GenBank database. Therefore, I would like to know
1) how to generate FASTA sequence of the NA gene from the MiSeq sequenced reads?
2) What are the factors to be considered while generating a phylogenetic tree for viruses? because viruses mutates at a much faster rate compared to prokaryotes.
3) Any tools recommended for performing the phylogenetic tree.
Thanks in advance.
Do you have the alignment files for these samples already? See this past thread for some inspiration and thoughts: How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing
Edit: It was you who I was talking with 2 years ago :-)
Yeah :)! I forgot about it.
I am planning to generate the phylogenetic tree from the NGS reads. Would you recommend any interesting articles and tools to explore?
Issues mentioned in that last thread still remain valid. If you create a consensus sequence then you may lose interesting variation but making a phylogenetic tree from individual sequences does not make a lot of sense.
You could try to assemble the viral genomes using
tadpole.sh
from BBMap suite to see if are able to use the assemblies to fish out the NA gene and then do trees with those sequences.Hi Genomax,
I tried the following steps to get the FASTA sequence from NGS reads.
1) bin/bamtools filter -region gi\|758899355\|ref\|NC_026438.1\|:1..1700 -in Sample2.sorted.bam -out Sample2.NC_026438.sorted.bam
2) bin/bamtools convert -format fasta -in Sample2.NC_026438.sorted.bam -out Sample2.NC_026438.sorted.fasta
3) In the fasta file, I found 35,072 reads matching the above region.
4) bbmap/bbmap_37.02/tadpole.sh in=Sample2.NC_026438.sorted.fasta out=Sample2.NC_026438.sorted.contig
5) I got 5 contigs as the output,
If you have the fasta sequences you could try doing an MSA but be aware that the reads could have bad regions (unless you had scanned/trimmed you reads before alignments).
Those contigs from tadpole seem rather small. There should be no need to merge the contigs since all you need to figure out is which of those contigs contain the NA gene.
Is this targeted sequencing by any chance?
Yes, I have trimmed them. But I have 35K FASTA sequence for 35,072 reads (mapping to gi|758899355|ref|NC_026438.1|:1..1700). In the past, for bacteria, I used gene sequence from multiple strains and generated phylogenetic tree. In this case, I have 35K FASTA sequence. So I am confused.
NC_026438.1 this gene sequence length is 1700, but contigs are smaller in length. So do I have only partial gene sequence in those contigs?
Yes, this is targeted sequencing.
This was important information. So those 35K reads represent just one strain you are working with? You could make MSA from those reads to use for the phylogenetic analysis. But then you could have made that consensus from your BAM file by using samtools/bcftools directly (method in the old thread).
It is possible that you may need to tweak
tadpole.sh
analysis (looks like you used defaults) to get the assembly to complete. How many reads were left over after the assembly? But before you do anything else you should align the five contigs to NC_026438 to see how and where they align and if the assembly looks reasonable. Was there any indication in your BAM that the gene was not completely covered (or if that was by design then there is nothing you can do but use what you have).Hi Genomax,
As suggested, I used samtools
I have pasted the contents from the cns.fq. Some bases are in upper case and some in lower case and there are n's.
What do they mean?
That is not a fastq format file. What do you get with just the first part of mpileup command?
Sorry, I pasted just the sequence part alone in the previous thread. This is what I got in the cns.fq output file.
@gi|758899355|ref|NC_026433.1|
SEQUENCE IS BIGGER SO I DINT PASTE IT AGAIN
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Z????!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!???????????????????????????? ???????????????????????????????EEEEEEEEEEEEEEEEEEEEEEEEEEH HHHHHHHHHHHHKKKKKKKKKKNNNNNNNNNNNNNNNNNNNNNNN@NNN NNNNNNNNNNNNNNNNKKKKKKKKKKKKKKKKKKKKKNNNNNNNNNNNNN NNNNNNNNNNNNNNNQNNNNNNNHHHHHHHHHHHHHHHHEHHBHHHHHH HHEEEEEEEEEEEEEEEEEEEEEBBBBBBBBBBBBBBBBBBBEEEEEEEEEKK KKKK&KKKKKKKKKKKKKKKKKKKKKKNNNNNNNTQQQQTTZZZ]]]]]]]]ccccccc cccc``
ccccccccfflllllllllllllffffffffffffffffffff
ZZZZ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~I took the sequence from this step and BLASTed it with the reference gene sequence (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&RID=5F7TV6FT114).
free image host no sign up
Looks like you have a gap in your sequence of ~74 bp which is represented by those
n
.Hi,
I randomly checked it on different samples. In those samples, I am getting N's at the middle of the sequence.
Does it mean, we couldn't sequence that region or it is a deletion?
If there are absolutely no reads covering that sequence then you could have a deletion. Was the experiment supposed to recover the entire region? Won't it mean that you have a defective NA gene in all these samples?
Keep in mind that samtools by default uses a depth of 250 for mpileup. If you have a ton of reads in this region then you will have to increase that using
-d
option.