Which tool can remove SignalP predicted signal peptides from FASTA file?
1
0
Entering edit mode
6.0 years ago
elabb@fau • 0

L.S.,

I have a list of proteins from either the UniProtKB or PlasmoDB databases that have a SignalP annotation. These proteins are thus predicted to have a signal peptide, of varying length, for secretion. I can manually remove the sequence corresponding to the predicted signal peptide, but takes a lot of time :(

I was wondering if it's possible to these kinds of operations automatically, perhaps using some kind of online tool. Or do I need to program a script of some sort to perform the operation?

Kind regards, Arman

sequence • 3.0k views
ADD COMMENT
1
Entering edit mode

If you can get the ranges for each protein (without the signal peptide) in the form of a BED file then you can use bedtools getfasta (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) to do this.

For the initial table of ranges, you can download the UniProt data in GFF format and parse that table. Can you provide some examples?

ADD REPLY
0
Entering edit mode

if you're able to put together a script that will be most convenient I assume.

ADD REPLY
0
Entering edit mode

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLY
4
Entering edit mode
6.0 years ago
vkkodali_ncbi ★ 3.8k

I followed your Uniprot link and clicked on the 'Download' button to download two files:

  1. Download all 359 proteins in GFF format (uniprot.gff file)
  2. Download all 359 proteins in FASTA format (uniprot.fasta file)

Then, I processed the two files as follows:

  1. Process uniprot.gff file to create a BED-like file that has three columns: uniprot accession, protein start position, protein end position.
  2. Process uniprot.fasta file to convert the headers to just have only the uniprot accession
  3. Use bedtools getfasta to fetch the mature protein sequences

You can use the following code:

## step 1 - processing uniprot GFF file
cat uniprot.gff \
    | grep -E '^##sequence-region|Signal peptide' \
    | perl -pe 's/##sequence-region ([^ ]*) (\d+) (\d+)/\1\t\2\t\3/g' \
    | awk 'BEGIN{FS="\t";OFS="\t"}{if (NF==3) {p=$1; e=$3} else {s=$5+1; print p,s,e}}' \
    > uniprot.bed

## step 2 - processing the uniprot.fasta file. Note, this overwrites the existing file
sed -ri 's/>[a-z]*\|([^\|]*).*$/>\1/g' uniprot.fasta

## step 3 - generate new fasta file with just the mature peptide sequences
bedtools getfasta -fi uniprot.fasta -bed uniprot.bed | fold -w 60 > uniprot.mat_pep.fasta

Out of the 359 proteins, one of them (Q7KQM4) did not have signal peptide so it is not included in the final output file uniprot.mat_pep.fasta.

ADD COMMENT
0
Entering edit mode

Alright! I have to process this information in order for me to fully understand what you've done ;) Can you send me the file with the mature protein sequences?

Thank you for your time!

ADD REPLY
0
Entering edit mode

If you run the commands shown above as-is you should end up with uniprot.mat_pep.fasta file. Are you having trouble running them? Here's the file: https://drive.google.com/open?id=1coo2uipv-zTK1F98xi09zfh6-Ahmmykt

ADD REPLY

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6