How to find start and stop codon for sequences in a fasta file?
2
0
Entering edit mode
9.6 years ago
grayapply2009 ▴ 300

I did blastn and blastx for my sequences (~400,000 sequences). How do I find and label the start and stop condon for each sequence in a fasta file?

next-gen • 7.7k views
ADD COMMENT
1
Entering edit mode
9.6 years ago
Kamil ★ 2.3k

I suggest that you read about the genetic code to find the codons relevant to your organism.

You'll want to search for codons, perhaps with a tool like fasgrep. You might write your own script if you have a particular output format in mind.

On second glance, it seems that fasgrep is only useful for searching for sequence identifiers, not the sequences themselves.

ADD COMMENT
0
Entering edit mode

Thank you for your information, Kamil. So this fastgrep works like ExPASy? It picks the longest possible translated sequence?

ADD REPLY
1
Entering edit mode

fasgrep is like grep. It searches for a string in a body of text. In your question, you ask about finding codons. I'd recommend using a search tool like grep to find codons.

If you have a different goal, you should edit your question. For example, if you wish to find possible coding sequences within a nucleotide sequence, you might consider other tools designed for this purpose:

As you mentioned, ExPASy is a nice portal to find other tools that might meet your needs.

ADD REPLY
0
Entering edit mode

Yeah, I want to identify start and stop codon for each sequence but how do I know the codons grepped by fastgrep are correct for the coding sequence? I mean there are multiple "ATG"s or "TAG"s. Does this program take frame shift into consideration?

Besides, how do I label those codons when I grep them in a fasta file?

ADD REPLY
0
Entering edit mode

If existing programs do not meet your needs, then you should write your own scripts to achieve your goals. If you're familiar with Python, this looks like a good starting point: Identifying open reading frames

Consider providing an example of your input and an example of your desired output. That might increase the clarity of your question.

ADD REPLY
0
Entering edit mode

Great! I'll take a look at the code. Thank you, Kamil!

ADD REPLY
0
Entering edit mode
9.6 years ago

Depending upon you got these sequence, it is likely that the start and/or the stop codon are missing

BlastX will be able to find a homologous protein sequence based upon the translation of a internal part of your sequence even though it lack the start and stop codon

ADD COMMENT
0
Entering edit mode

How do I find start and stop condon in the fasta file if the sequences have at least one of them?

ADD REPLY
1
Entering edit mode

For an individual sequence, you can try services like:

  • NCBI ORFFinder
  • Try EMBOSS. There are several programs available, in graphic and text mode. EMBOSS will allow you to use a fasta file with many sequences at once.
ADD REPLY

Login before adding your answer.

Traffic: 2446 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6