Obtaining nucleotide sequences from a set of protein identifiers
2
1
Entering edit mode
5.0 years ago
hopf ▴ 10

I've done a bit of searching but haven't seen this specific problem raised (or answered) before. Apologies if it has.

I'm interested in a particular (bacterial) protein family. I have reason to believe the genomic context of these proteins will be interesting as well. So what I'd like to do is obtain every unique genomic context it's found in. That is, whether or not the protein is identical, if the genomic region surrounding the gene isn't the same, consider it unique.

So far, I've obtained a set of uniprot identifiers that match my HMM, and from those extracted protein IDs. Many of these are WP sequences, which means the identical protein is found in multiple genomes. I think I've found a way to use entrez to link from these redundant sequences to genome accessions, but it's sort of slow.

Is there a better way to do this? I feel like there must be, but I haven't been able to put it together.

sequence cds entrez elink etools • 1.5k views
ADD COMMENT
0
Entering edit mode

I think I've found a way to use entrez to link from these redundant sequences to genome accessions, but it's sort of slow.

Roughtly, is this how you are doing it? (note, I am downloading the CDS sequences for just the RefSeq genomic accessions here)

epost -db protein -format acc -input <protein_accs.list> \
    | efetch -db protein -format ipg \
    | grep 'RefSeq' \
    | cut -f3,4,5,6 \
    | while read -r acc start stop strand ; 
        do 
            efetch -db nuccore -id $acc -seq_start $start -seq_stop $stop -strand $strand -format fasta ;
        done

If that's the case, then I am afraid you cannot get it any faster using Entrez Direct. How many protein accessions are you starting with?

Here's an alternate way: 1. Make an IPG report of all of the protein accessions you have. This report has the GCF assembly accession. 2. You can download the entire genomes in fasta format corresponding to those GCF assembly accessions from NCBI FTP site. 3. From the IPG report, make a BED file with seq-id, start, stop and strand information 4. Use bedtools getfasta with the BED file from #3 and genome sequences from #2.

ADD REPLY
1
Entering edit mode
5.0 years ago
Mensur Dlakic ★ 28k

I suggest you try STRING. After selecting "Search" you can enter either protein name or sequence. Once in results you should be able to switch the viewer from "Network" (default) to "Neighborhood." There are other useful tools that might interest you.

The only drawback I see with regard to your needs is that not all sequenced genomes will be available unlike with regular database search.

ADD COMMENT
0
Entering edit mode

STRING is nice, but I don't see a way to get this programmatically for a large collection of sequences. I'd like to be able to find contexts for a fairly large number of sequences.

ADD REPLY
0
Entering edit mode

The more information you provide, the easier it is to offer advice. I don't know what exactly "a fairly large number of sequences" means, but you can download STRING data or use their API.

ADD REPLY
1
Entering edit mode
5.0 years ago
Chirag Parsania ★ 2.0k

Check this out. Step by step guidance to obtain CDS from protein ID. It may help.

ADD COMMENT
0
Entering edit mode

It looks like your etools command downloads the entire genome or scaffold and then filters it for the CDS. I'd prefer to just grab the CDS coordinates, maybe using identical Protein Groups somehow, and then fetch those specifically. I'm looking at several thousand unique protein sequences, so the overhead will be lower thi

ADD REPLY
0
Entering edit mode

Try my script, ProteinFASTASearchByFASTATitle.py, from here: https://github.com/kevinblighe/PythonScripts

python3 ProteinFASTASearchByFASTATitle.py -e kevin@clinicalbioinformatics.co.uk -t "Bacillus anthracis"
>WP_154574506.1 IS3 family transposase, partial [Bacillus anthracis]
KKDEYSIKEICILIGIPRSTYYRWKNKEKDVKEAKLEQAILTICMTNHFRYGHRKVTALLKRKYNYHPNR
KTVQKIMQKKNLQCRVKRKRRTWINGESRIVVENLLNRNFQANKPNEKWVTDITYLPFGTEMLYLLSIMD
LYNNEIIAYEISNRQDVTLVLRTVEKAIKLQQKTQIILHSDQGAVYTSYAFQTLSKKMALPQVCPVKEIV
MIMP


>WP_154556816.1 DUF4180 domain-containing protein, partial [Bacillus anthracis]
FAIVGDFSMYTSKSLKDFIYECNKGKDIFYLATEQQAIEKLSTLK


>WP_154556815.1 helix-turn-helix transcriptional regulator, partial [Bacillus anthracis]
MEFYDLGITIKELRIKKNISQSELCHGICSQSQISKIEKGVIYPSSILLYQLSERLGIDPNNIFALTKNK
KFKYIENVKCIMKDCIRQHQ


>WP_154556814.1 DUF4180 domain-containing protein, partial [Bacillus anthracis]
MEIKKVVIDGINIAVIRNNKVLISDVQSALDTMATVQYEVNAKHIIIHKSLISEDFFDLKTRLAGDIL
ADD REPLY

Login before adding your answer.

Traffic: 1677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6