Question

Software or PERL code to erase non-orthologous proteins in FASTA file based on OrthoFinder output file?

0

Entering edit mode

9.1 years ago

martindaniel_150988 • 0

So, I am using OrthoFinder to discriminate orthologous from non-orthologous proteins between two strains. I want to find a software or PERL code to perform this:

1. Reading the protein code from the OrthoFinder output file

e.g., OG0008170: 0023|379.49.peg.7038

2. Search the code in an amino-acid FASTA file and delete it together with the sequence.

e.g.,

>0023|379.49.peg.7038
MSGNSEVRVENLVVERGGKRVIHDISFSVAAGKVTTLLGANGAGKSSTVMAMAGVLPRGG
AVRLGDVALEGFVPDRIRRAGLALVPEGHRVLGQLSVEDNILVAALDPSATARRQGLERA
YEIFPELAERRRQSASDLSGGQKQMVAMAQAFVAKPRFMIVDELSLGLAPAVVKRLAEAL
KIAAAGGIGVLLIEQFANLALDLADKALVLERGRLVFDGPAATLKGQPDILHGAYLAS

Thank you guys

orthologous PERL OrthoFinder • 3.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by martindaniel_150988 • 0

0

Entering edit mode

see How To Remove Certain Sequences From A Fasta File

ADD REPLY • link 9.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I am kind of a rookie on this. How am I suposed to run the SeqFilter program??

Thank you so much!

ADD REPLY • link 9.1 years ago by martindaniel_150988 • 0

0

Entering edit mode

Assuming you are using a linux command line, use the full path to the file:

/path/to/SeqFilter/bin/SeqFilter

ADD REPLY • link 9.1 years ago by thackl ★ 3.0k

0

Entering edit mode

Hey, I've been trying SeqFilter these days. I have noticed that the out file still contains some of the undesired sequences. Have a look of my command line:

cut -d" " -f2 AccGen0014BIS.fa | bin/SeqFilter 0014BIS.fasta --ids 0014 --ids-exclude --out FASTA-filtered4.fa

AccGen0014BIS.fa = File containing the sequence codes to be deleted from 0014BIS.fasta.

After --ids y put, as you can see: 0014, for this is the identifier I considered the most adequate; I also tried with 00, and gave the same result.

Maybe I am missing someting to make a total sweeping of my original 0014BIS.fasta file. Any guidance you could provide, would be extremely helpful. Thak you again.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by martindaniel_150988 • 0

0

Entering edit mode

Okay, so I think you misunderstood the command somewhat.

--ids needs a list of IDs, either provided directly, --ids "seq1,seq2,seq3" or by providing the name to a file with ids (new line separated). In the example above --ids - means to reads a list of IDs from STDIN. The list is generated by the cut command from your Orthofile.

How does AccGen0014BIS.fa look like - is it a FASTA file or Orthofinder output?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by thackl ★ 3.0k

0

Entering edit mode

AccGen0014BIS.fa looks like this:

OG0005194: 0014|379.23.peg.2003
OG0005195: 0014|379.23.peg.2004
OG0005196: 0014|379.23.peg.2005

And the FASTA file, like this:

>0014|379.23.peg.1984
MIIGVDINPDRKEWGEKFGMTHFVNPKEVGDDIVPYLVNLTKRNGDLIGGADYTFDCTGN
TKVMRQALEASHRGWGKSIIIGVAGAGQEISTRPFQLVTGRNWMGTAFGGARGRTDVPDI
VDWYMQGKIQIDPMITHTMPLDDINKGFDMMHKGESIRGVVVY
>0014|379.23.peg.1985
MASATYTADLKSVDELLARELYDLLKMRVDVFVVEQNCAYPELDGKDIDALHLRLLENGE
LLASARILKPHGPHEPSKIGRVVVSPAHRGKRLGDALMSESISACERLYPANPIALSAQA
HLRRFYEAFGFSVASEEYLEDGIPHIDMVRELAIRPAGISS

Which ID can I use?

Thank you.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by martindaniel_150988 • 0

0

Entering edit mode

Man, your software worked great! Two things:

1) Is it possible to use SeqFilter this way?:

Having these groups of orthologous:

i.e.:

OG0004044: 0023|379.49.peg.3509 0023|379.49.peg.6833 0014|379.23.peg.6046
OG0005393: 0014|379.23.peg.1985 0023|379.49.peg.6315

I want to find each of the five proteins (3 of which are in the first line, and the other two in the second) in another FASTA file and erase them (that is, erasing the protein code [e.g.: 0023|379.49.peg.3509] and the subjacent aminoacid sequence), so that any other protein not being in the first file will remain in an output file.

2) How can I cite RefSeq?

Once again, I am deeply thankful for your guidance.

Greetings from Mexico.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by martindaniel_150988 • 0

0

Entering edit mode

I'm not 100% sure I fully understand what you want to do - my guess - you want to extract the 5 sequences from file A and move them to file B. You can kind of do that in two steps:

# grep the five sequences
cut -d" " -f2- ORTHOFILE | tr ' ' '\n' | SeqFilter --ids - --out B.fa A.fa
# create file without five sequences
cut -d" " -f2- ORTHOFILE | tr ' ' '\n' | SeqFilter --ids - --ids-exclude --out A2.fa A.fa

I did not publish SeqFilter as a stand-alone tool, you can either cite just the github page, or you can use https://github.com/BioInf-Wuerzburg/proovread#citing-proovread, which is a program I developed and which includes SeqFilter as an important module.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by thackl ★ 3.0k

0

Entering edit mode

Wait, I've just tried by leaving the - sign, just as you originally posted, and it seems to work perfect.

Thank you again. If anything comes up, I will seek your guidance, if it OK.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by martindaniel_150988 • 0

Ram · Answer 1 · 2015-10-09

1

Entering edit mode

9.1 years ago

thackl ★ 3.0k

This should work, assuming the OrthoFinder output is always as simple as your example.

git clone https://github.com/BioInf-Wuerzburg/SeqFilter.git
cd SeqFilter
make  # just fetches some libraries, no root or anything required
cut -d" " -f2 ORTHOFILE | SeqFilter FASTA.fa --ids - --ids-exclude --out FASTA-filtered.fa

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by thackl ★ 3.0k