Question

How to generate FASTA sequence of genes and phylogenetic analysis from NGS sequenced reads?

0

Entering edit mode

7.0 years ago

bioinforesearchquestions ▴ 370

Hi,

One of our collaborators lab recently isolated virus samples from patients. I received the sequenced reads from MiSeq for those virus samples. Then I investigated the SNVs present in the samples. Now I am interested in creating a phylogenetic tree for Neuraminidase gene with the other NA sequence in the GenBank database. Therefore, I would like to know

1) how to generate FASTA sequence of the NA gene from the MiSeq sequenced reads?

2) What are the factors to be considered while generating a phylogenetic tree for viruses? because viruses mutates at a much faster rate compared to prokaryotes.

3) Any tools recommended for performing the phylogenetic tree.

Thanks in advance.

DNAseq FASTA Phylogenetic analysis NGS • 2.9k views

ADD COMMENT • link updated 6.9 years ago by Biostar 20 • written 7.0 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

I investigated the SNVs present in the samples

Do you have the alignment files for these samples already? See this past thread for some inspiration and thoughts: How to extract fasta sequence of a gene from BAM files generated using Miseq sequencing

Edit: It was you who I was talking with 2 years ago :-)

ADD REPLY • link 7.0 years ago by GenoMax 148k

0

Entering edit mode

Yeah :)! I forgot about it.

I am planning to generate the phylogenetic tree from the NGS reads. Would you recommend any interesting articles and tools to explore?

ADD REPLY • link 7.0 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Issues mentioned in that last thread still remain valid. If you create a consensus sequence then you may lose interesting variation but making a phylogenetic tree from individual sequences does not make a lot of sense.

You could try to assemble the viral genomes using tadpole.sh from BBMap suite to see if are able to use the assemblies to fish out the NA gene and then do trees with those sequences.

ADD REPLY • link 7.0 years ago by GenoMax 148k

0

Entering edit mode

Hi Genomax,

I tried the following steps to get the FASTA sequence from NGS reads.

1) bin/bamtools filter -region gi\|758899355\|ref\|NC_026438.1\|:1..1700 -in Sample2.sorted.bam -out Sample2.NC_026438.sorted.bam

2) bin/bamtools convert -format fasta -in Sample2.NC_026438.sorted.bam -out Sample2.NC_026438.sorted.fasta

3) In the fasta file, I found 35,072 reads matching the above region.

4) bbmap/bbmap_37.02/tadpole.sh in=Sample2.NC_026438.sorted.fasta out=Sample2.NC_026438.sorted.contig

5) I got 5 contigs as the output,

contig_1,length=178,cov=8182.4,gc=0.416

contig_2,length=454,cov=4072.5,gc=0.374

contig_3,length=125,cov=191.4,gc=0.368

contig_4,length=443,cov=161.0,gc=0.433

contig_5,length=264,cov=4.8,gc=0.436

Should I merge all these contigs as one scaffold?

ADD REPLY • link 7.0 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

If you have the fasta sequences you could try doing an MSA but be aware that the reads could have bad regions (unless you had scanned/trimmed you reads before alignments).

Those contigs from tadpole seem rather small. There should be no need to merge the contigs since all you need to figure out is which of those contigs contain the NA gene.

Is this targeted sequencing by any chance?

ADD REPLY • link 7.0 years ago by GenoMax 148k

0

Entering edit mode

Yes, I have trimmed them. But I have 35K FASTA sequence for 35,072 reads (mapping to gi|758899355|ref|NC_026438.1|:1..1700). In the past, for bacteria, I used gene sequence from multiple strains and generated phylogenetic tree. In this case, I have 35K FASTA sequence. So I am confused.

NC_026438.1 this gene sequence length is 1700, but contigs are smaller in length. So do I have only partial gene sequence in those contigs?

Yes, this is targeted sequencing.

ADD REPLY • link 7.0 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Yes, this is targeted sequencing.

This was important information. So those 35K reads represent just one strain you are working with? You could make MSA from those reads to use for the phylogenetic analysis. But then you could have made that consensus from your BAM file by using samtools/bcftools directly (method in the old thread).

It is possible that you may need to tweak tadpole.sh analysis (looks like you used defaults) to get the assembly to complete. How many reads were left over after the assembly? But before you do anything else you should align the five contigs to NC_026438 to see how and where they align and if the assembly looks reasonable. Was there any indication in your BAM that the gene was not completely covered (or if that was by design then there is nothing you can do but use what you have).

ADD REPLY • link 7.0 years ago by GenoMax 148k

0

Entering edit mode

Hi Genomax,

As suggested, I used samtools

samtools mpileup -uf reference.fa Sample2.NC_026433.sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

I have pasted the contents from the cns.fq. Some bases are in upper case and some in lower case and there are n's.

What do they mean?

@gi|758899355|ref|NC_026433.1|
ATGAAGGCAATACTAGTAGTTCTGCTATATACATTTGCAACCGCAAATGCAGACACATTA
TGTATAGGTTATCATGCGAACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAAT
GTAACAGTAACACACTCTGTTAACCTTCTAGAAGACAAGCATAACGGGAAACTATGCAAA
CTAAGAGGGGTAGCCCCATTGCATTTGGGTAAATGCAACATTGCTGGCTGGATCCTGGGA
AATCCAGAGTGTGAATCACTCTCCACAGCAAGTTCATGGTCCTACATTGTGGAAACATCT
AGTTCAGACAATGGAACGTGTTACCCAGGAGATTTCATCAATTATGAGGAGCTAAGAGAG
CAATTGAGCTCAGTGTCATCATTTGAAAGGTTTGAGATATTCCCCAAGACAAGTTCATGG
CCCAATCATGACTCGAACAAAGGTGTAACGGCAGCATGTCCTCACGCTGGAGCAAAAAGC
TTCTACAAAAATTTAATATGGCTAGTTAAAAAAGGAAATTCATACCCAAGGctcagtcaa
tcCTACATTAATGATAAAGGGAAAGAAGTCCTCGTGCTGTGGGGCATTCACCATCcatcn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnagaagttcaagccggaaatagcaataagacccaaagtgagggatcaa
gaagggagaatgAACTATTACTGGACACTAGTAGAGCCGGGAGACAAAATAACATTCGAA
GCAACTGGAAATCTAGTGGTACCGRGATATGCATTCACAATGGAAAGAAATGCTGGATCT
GGTATTATCATTTCAGATACACCAGTCCACGATTGCAATACAACTTGTCAGACACCCGAG
GGTGCTATAAACACCAGCCTCCCATTTCAGaatatacatccgatcacaaTTGGAAAATGT
CCAAAGTATGTAAAAAGCACAAAATTGAGACTGGCCACAGGATTGAGGAATGTCCCGTCT
ATTCAATCTAGAGGCCTATTCGGGGCCATTGCCGGCTTCATTGAAGGGGGGTGGACAGGG
ATGGTAGATGGATGGTACGGTTATCACCATCAAAATGAGCAGGGGTCAGGATATGCAGCC
GACCTGAAGAGCACACAAAATGCCATTGACAAGATTACTAACAAAGTAAATTCTGTTATT
GAAAAGATGAATACACAGTTCACAGCAGTGGGTAAGGAGTTCAACCACCTGGAAAAAAGA
ATAGAGAATTTAAATAAAAAAGTTGATGATGGTTTCCTGGACATTTGGACTTACAATGCC
GAACTGTTGGTTCTATTGGAAAATGAAAGAACTTTGGACTACCACGATTCAAATGTGAAG
AACTTGTATGAAAAGGTAAGAAACCAGCTAAAAAACAATGCCAAGGAAATTGGAAACGGC
TGCTTTGAATTTTACCACAAATGCGATAACACGTGCATGGAAAGTGTCAAAAATGGGACT
TATGACTACCCAAAATACTCAGAGGAAGCAAAATTAAACAGAGAAAAAATAGATGGGGTA
AAGCTGGAATCAACAAGGATTTACCAGATTTTGGCGATCTATTCAACTGTCGCCAGTTCA
TTGGTACTGGTAGTCTCCCTGGGGGCAATCAGCTTCTGGATGTGCTCTAATGGGTCTCTA
CAGTGTAGAATATGTATTTAA

ADD REPLY • link 7.0 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

That is not a fastq format file. What do you get with just the first part of mpileup command?

ADD REPLY • link 7.0 years ago by GenoMax 148k

0

Entering edit mode

Sorry, I pasted just the sequence part alone in the previous thread. This is what I got in the cns.fq output file.

@gi|758899355|ref|NC_026433.1|

SEQUENCE IS BIGGER SO I DINT PASTE IT AGAIN

+

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Z????!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!???????????????????????????? ???????????????????????????????EEEEEEEEEEEEEEEEEEEEEEEEEEH HHHHHHHHHHHHKKKKKKKKKKNNNNNNNNNNNNNNNNNNNNNNN@NNN NNNNNNNNNNNNNNNNKKKKKKKKKKKKKKKKKKKKKNNNNNNNNNNNNN NNNNNNNNNNNNNNNQNNNNNNNHHHHHHHHHHHHHHHHEHHBHHHHHH HHEEEEEEEEEEEEEEEEEEEEEBBBBBBBBBBBBBBBBBBBEEEEEEEEEKK KKKK&KKKKKKKKKKKKKKKKKKKKKKNNNNNNNTQQQQTTZZZ]]]]]]]]ccccccc cccc``ccccccccfflllllllllllllffffffffffffffffffffZZZZ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~

I took the sequence from this step and BLASTed it with the reference gene sequence (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&RID=5F7TV6FT114).

free image host no sign up

ADD REPLY • link 7.0 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Looks like you have a gap in your sequence of ~74 bp which is represented by those n.

ADD REPLY • link 7.0 years ago by GenoMax 148k

0

Entering edit mode

Hi,

I randomly checked it on different samples. In those samples, I am getting N's at the middle of the sequence.

Does it mean, we couldn't sequence that region or it is a deletion?

ADD REPLY • link 7.0 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

If there are absolutely no reads covering that sequence then you could have a deletion. Was the experiment supposed to recover the entire region? Won't it mean that you have a defective NA gene in all these samples?

Keep in mind that samtools by default uses a depth of 250 for mpileup. If you have a ton of reads in this region then you will have to increase that using -d option.

ADD REPLY • link 7.0 years ago by GenoMax 148k