How to extract a specific gene sequence from a batch of genome files that I have?
2
0
Entering edit mode
4.7 years ago

Hi I have been trying to extract a specific gene sequence from a batch of multiple genome files. Could anyone help with this pls.?

genome gene sequence alignment • 3.4k views
ADD COMMENT
0
Entering edit mode

Please add example data and desired output. What did you try and where are the problems?

ADD REPLY
0
Entering edit mode

Hi, For exapmle, i have a gene X in a fasta format and I wanted to blast this gene against a bunch of genome fastas and retrieve the matches from each genome. Most of the examples given are for searching genes using IDs. In my case i do not have a ID but have the gene sequence in a file instead.

ADD REPLY
0
Entering edit mode

Hi all, I tried a way of using makeblastdb

Accordingly ran

makeblastdb -in AI1198.fasta -parse_seqids -blastdb_version 5 -title “ecoli” -dbtype prot -out ecoli

Then did blastdbcmd using

blastdbcmd -db ecoli -entry_batch gyrA.fasta -outfmt %f -out ref.fasta

However, getting the error as follows

Error: [blastdbcmd] Skipped >GYRA

Error: [blastdbcmd] Skipped CATGTTAAATAAGGGGCTGGAAAGCCTTGCAGAAGGAGAAAAACCGCTGCTCCACAGTGATCAAGGGTGG
Error: [blastdbcmd] Skipped CATTACCGGATAAAAAGTTATCAGTCTGACCTGGCAGACAAAGGGTTAGTGCAGAGCATGTCGCGCAAGG
Error: [blastdbcmd] Skipped GCAACTGTCTTGATAATGCGGTAATGGAGAATTTCTTTGGTCACCTGAAAGAAGAAATTTACTATCGCCG
Error: [blastdbcmd] Skipped GGACTACAGAAGTGTAGAAGAGCTGGAAAATGCCGTTAACGAATACATAACTTACTGGAATCAAAAAAGA
Error: [blastdbcmd] Skipped ATAAAACTCAGTCTTGGGGGACTCAGCCCGGTAGAATACCGGACTGAGTATCAAAAAGCCGGTTAACTGA
Error: [blastdbcmd] Skipped AACTGTCCAGGTTTTGGGGTTCAGTTCATGGCGCGTCTTATCCAGCTTACGCAGGCACGATAGGGGGCAG
Error: [blastdbcmd] Skipped CTTATTCCTCCACATACGCCAGCGAATACGGCTTCCCCAACTTGCCCACCTCCATACGTGTCCTCCTTAC
Error: [blastdbcmd] Skipped CAGAAATTTATCCTTAAGCTCCTCAATAACCATTTTCCTGCTAACTAAATTCGTGGTTAAGGTTGCATGA
Error: [blastdbcmd] Skipped TGATATGCAACAAATGTATAACATTTTCTTTACCAAAAGAAATAAACAAAAGCGACCGACAAAAGCATCG
Error: [blastdbcmd] Skipped GATTACGGCAGGAGACATAATGGCATGGCTGAAAGTACTGTAACGGCAGACAGCAAACTGACAAGTAGTG
Error: [blastdbcmd] Skipped ATACTCGTCGCCGCATTTGGGCAATTGTGGGGGCCTCTTCAGGTAATCTGGTCGAGTGGTTCGATTTCTA
Error: [blastdbcmd] Skipped TGTCTACTCGTTCTGTTCACTCTACTTCGCTCACATCTTCTTCCCTTCCGGGAACACGACGACTCAACTA
Error: [blastdbcmd] Skipped CTGCAAACAGCAGGTGTTTTTGCTGCGGGATTTCTGATGCGCCCAATAGGCGGTTGGCTATTTGGCCGC
ADD REPLY
0
Entering edit mode

It is still unclear what you are trying to achive? Based on this latest post your title is not quite describing the desired outcome: How to extract a fasta sequence from a batch of multiple fasta genome files. It appears that you are interested in a specific gene sequence , either from databases or from your own genome sequences. Is that correct?

What are the contents of AI1198.fasta? E. coli genomes?

blastdbcmd retrieves full sequence present in your blast database. If there are genomes in your database then it is going to retrieve entire genomes. Not just the gene sequence you are interested in.

If you need to get gyrA genes from E. coli genomes then:

  1. Search for that gene at NCBI Gene database
  2. Click on the Result by taxon and select E. coli.
  3. Download the gene sequences
  4. If you want to get gyrA gene from all entries you could just download directly after the search.

If you have genomes and need the gene sequence from those, then you will need to take gyrA representative gene sequence. Blastn or blat against your genomes of interest and then parse the blast results to retrieve sequneces you need.

ADD REPLY
0
Entering edit mode

Hi, Yes you are right. " It appears that you are interested in a specific gene sequence , either from databases or from your own genome sequences. Is that correct?" --- this is what I am trying to achieve.

"If you have genomes are need the gene sequence from, then you will need to take gyrA representative gene sequence. Blastn or blat against your genomes of interest and then parse the blast results to retrieve sequneces you need." -- I will try this to retrieve gene of interest from my genomes, and will post the results.

Thanks.

ADD REPLY
0
Entering edit mode

Please edit the original question and add this information there. Also update the title of this post to reflect that you are looking for sequences of a specific gene from genome data you have.

ADD REPLY
0
Entering edit mode
4.7 years ago

If you know the gene id you can use seqkit grep.

ADD COMMENT
0
Entering edit mode

This may not work for entire gene sequences (do you know)? Also keep this caveat in mind.

-s, --by-seq                search subseq on seq, only positive strand is searched
ADD REPLY
0
Entering edit mode

Hi, I do not have a gene id in my case. I will try your suggestion.

thanks, Nav

ADD REPLY
0
Entering edit mode
4.7 years ago
ale_abd ▴ 50

Supposing you have all your fasta files in the same directory and that they are not linearized (sequence in more than one line):

cat *.fasta | awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {printf "\n"}' | grep -A1 ACC > ACC.fasta

With cat you concatente all the files, then with awk you liniarize the sequences and then with grep you just retain your target accession and the following line, that means its sequence. Hope it helps.

ADD COMMENT
0
Entering edit mode

Hi, Thanks for the suggestion, I hope this will help me. I will try this. Also if I need to use the gene.fasta as input for grepping from the *.fasta how do i use it..?

ADD REPLY

Login before adding your answer.

Traffic: 1966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6