Question

Retrieve many FASTA of the same gene from NCBI..

0

Entering edit mode

5.7 years ago

uwardgma • 0

Hello! I am quite new to bioinformatics, please help me!

How do I retrieve multiple nucleotide sequences (from many species) of one same gene in FASTA format from NCBI? Say, if I want nucleotide sequences of a gene called 'lef-5' from all species of baculoviruses that exist in the database (i.e. I want to align sequences of homologous genes) without having to pick them up manually, what do I have to do?

When I tried to retrieve them from the 'Gene' database, I could not download them in FASTA. When I tried to retrieve them from the 'Nucleotide' database, they could be downloaded but they also became the complete genomes of the viruses that have the gene.

Thanks in advance for your help.

gene sequence fasta • 1.9k views

ADD COMMENT • link updated 5.7 years ago by GenoMax 147k • written 5.7 years ago by uwardgma • 0

1

Entering edit mode

Typically, that'd be a task I'd do via HomoloGene, but that's limited to eukaryotes -- the main reason I'm mentioning this here is because you may want to add the info that you're looking for viral/bacterial genes to the header of your question.

ADD REPLY • link 5.7 years ago by Friederike 9.0k

0

Entering edit mode

First you can go to paste your query sequence in fasta format in NCBI BLASTN for nucleotide ot BLASTP for protein. https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome

Fasta format:

Species/gene/anyname tgtgggtggtgtcggtggtggtgcgtggtcgtgggtaaagtgatag gaccattagatgatgatggaaaaaaaaaaaaaatgatgtagta

Then choose one of the options: Optimize for Highly similar sequences (megablast)

Else, if you want retrieve additional distantly related sequences, you can choose: Optimize for More dissimilar sequences (discontiguous megablast) or Optimize for Somewhat similar sequences (blastn) options.

Click BLASTN to run your sequences against database sequences. After having BLASTN output, you can click "All" for downloading, if you want to download all sequences or select the sequences if you want manually select them. Important thing for downloading homologous sequences is to click and choose the option "FASTA (aligned sequences)" on download option (Floppy Icon) on the top row of the alignment result. ) (If you not click "FASTA (aligned sequences)" the default line: "FASTA (complete sequence)" would automatically download a contig or even whole genome). So choose *FASTA (aligned sequences)" option. Then click to download in fasta output in .txt file and paste the sequence to word file to have correctly listed fasta sequences . Use ClustalX for further alignment from fasta. txt as input for the conversion to .fasta aligned format or Phylip format etc. Hope this suggestion may help.

ADD REPLY • link 5.7 years ago by pltbiotech_tkarthi ▴ 180

3

Entering edit mode

Again, this does not answer the toplevel question like in most answers you gave previously. OP is asking how to get a nucleotide sequence, and your answer starts with "paste your query sequence". Why are you doing that?

Minor allele frequency calculation -- OP asked for the principle of AF calculation, you link tools to do that.

databases for any specific gene -- OP asks for a database to look up mutations of a gene, you provide a link to a tool that annotates variants a user inputs.

Converting p value to -log10(p value) to plot the graph -- OP wants to know how to calculate logged p-values, you link a paper on NCBI and a website that contains functions to calculate p-values, but not one for log(p).

How to generate a mutated DNA sequence? -- OP asks for a way to generate mutated gene sequences based on existing VCFs, you link a tool that annotates variants and share a link on how to visualize allele frequencies.

Allele frequency visualization -- OP wants to visualize allele frequences, you link a variant annotator again.

It can happen that one misses the point of a question here and there, but you appear to provide answers and comments unrelated to the toplevel question on a consistent basis. I have to ask you to stop that. Right now, this can be interpreted (my opinion personally, not a consensus opinion of the moderators here, just to be clear) as spam. Note that I removed some of your answers/comments where they obviously had nothing to do with the toplevel question so the above links are not active anymore.

uwardgma , sorry to hijack this thread with my comment, it will be cleaned up once this issue is solved.

ADD REPLY • link 5.7 years ago by ATpoint 85k

0

Entering edit mode

Hi Pierre Lindenbaum (Administrator), I couldn't understand why not to answer to the questions in a simple way to make the query posting person understand easily? Just please go through my answer, is it not relevant to the above question? Kindly consult with moderator ATpoint to avoid contradicting points. How ATpoint says or consider that this is a spam? Thanks

ADD REPLY • link 5.7 years ago by pltbiotech_tkarthi ▴ 180

0

Entering edit mode

, is it not relevant to the above question?

it's not ; See ATpoint's comments.

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

What ATpoint alludes to is that your answers seem to go off tangents that are not relevant here. "Click BLASTN, enter query sequence, paste into Word" etc

instead, your answers should be more concept oriented. Here for example what you are trying to suggest is that one could use BLAST to search for similar sequences, though you should point out that in general there is no guarantee that only the sequences from genes with the same name will be returned as hits. In addition what the OP wants are the complete sequences, whereas in your answer, at best, one would get the aligned region, a big difference.

For your approach to work one would then need to post-process the results to select for unique taxids, and also match the desired gene name, and using those select the accession numbers for the individual genes. Finally one would need to use the accession numbers and download the full sequences for each gene. See how a more appropriate answer is a lot more complicated even at a conceptual level - talking about Word and pasting queries, selecting options etc are distracting and also not quite correct overall.

ADD REPLY • link 5.7 years ago by Istvan Albert 101k

score 2 · Answer 1 · 2019-03-18

Both solutions below will require you to do some additional work but will get you accession numbers and start/stop positions of the gene from multiple viruses. You may need to eliminate any that are not baculoviruses.

Method 1:

Download this file from NCBI gene FTP site. It maps genes to accession numbers.
You can they extract entries for lef-5 genes from this file by zgrep lef-5 gene2accession.gz. This should produce something like this (truncated). This is a list of all genes named lef-5.

tax_id GeneID  status  RNA_nucleotide_accession.version        RNA_nucleotide_gi       protein_accession.version       protein_gi      genomic_nucleotide_accession.version    genomic_nu
cleotide_gi     start_position_on_the_genomic_accession end_position_on_the_genomic_accession   orientation     assembly        mature_peptide_accession.version        mature_peptide_giSymbol<br>
28289   921476  -       -       -       AAK70747.1      14591842        U53466.2        14591762        -       -       ?       -       -       -       orf87 lef-5<br>
28289   921476  PROVISIONAL     -       -       NP_148871.1     14602324        NC_002816.1     14602241        68490   69218   -       -       -       -       orf87 lef-5<br>
56947   4155945 -       -       -       ABC61202.1      84683292        DQ333351.1      84683224        -       -       ?       -       -       -       lef-5<br>
56947   4155945 PROVISIONAL     -       -       YP_654489.1     109255340       NC_008168.1     109255272       57735   58457   -       -       -       -       lef-5<br>
58094   16479687        -       -       -       AGR57091.1      526120557       KC961304.1      526120503       -       -       ?       -       -       -       lef-5

Take a look at the header of the file to understand what the columns mean. You can use the start/stops from file above to

Method 2:

$ esearch -db nuccore -query "lef-5 [GENE] AND baculovirus" | elink -target gene | efetch -format native | grep -A 5 -e "lef-5"

    93. lef-5
    LEF-5 [Bombyx mori nucleopolyhedrovirus]
    Other Aliases: Bmnpvgp087
    Other Designations: LEF-5
    Annotation:  NC_001962.1 (79558..80355)
    ID: 1488714
    --
    76. lef-5
    LEF-5 [Mamestra configurata nucleopolyhedrovirus A]
    Other Aliases: McnAVgp087
    Other Designations: LEF-5
    Annotation:  NC_003529.1 (76462..77283, complement)
    ID: 935879