Entering edit mode
9.1 years ago
So, I am using OrthoFinder to discriminate orthologous from non-orthologous proteins between two strains. I want to find a software or PERL code to perform this:
1. Reading the protein code from the OrthoFinder output file
e.g., OG0008170: 0023|379.49.peg.7038
2. Search the code in an amino-acid FASTA file and delete it together with the sequence.
e.g.,
>0023|379.49.peg.7038
MSGNSEVRVENLVVERGGKRVIHDISFSVAAGKVTTLLGANGAGKSSTVMAMAGVLPRGG
AVRLGDVALEGFVPDRIRRAGLALVPEGHRVLGQLSVEDNILVAALDPSATARRQGLERA
YEIFPELAERRRQSASDLSGGQKQMVAMAQAFVAKPRFMIVDELSLGLAPAVVKRLAEAL
KIAAAGGIGVLLIEQFANLALDLADKALVLERGRLVFDGPAATLKGQPDILHGAYLAS
Thank you guys
see How To Remove Certain Sequences From A Fasta File
I am kind of a rookie on this. How am I suposed to run the SeqFilter program??
Thank you so much!
Assuming you are using a linux command line, use the full path to the file:
Hey, I've been trying SeqFilter these days. I have noticed that the out file still contains some of the undesired sequences. Have a look of my command line:
AccGen0014BIS.fa
= File containing the sequence codes to be deleted from 0014BIS.fasta.After
--ids
y put, as you can see: 0014, for this is the identifier I considered the most adequate; I also tried with 00, and gave the same result.Maybe I am missing someting to make a total sweeping of my original
0014BIS.fasta
file. Any guidance you could provide, would be extremely helpful. Thak you again.Okay, so I think you misunderstood the command somewhat.
--ids
needs a list of IDs, either provided directly,--ids "seq1,seq2,seq3"
or by providing the name to a file with ids (new line separated). In the example above--ids -
means to reads a list of IDs from STDIN. The list is generated by thecut
command from your Orthofile.How does
AccGen0014BIS.fa
look like - is it a FASTA file or Orthofinder output?AccGen0014BIS.fa
looks like this:And the FASTA file, like this:
Which ID can I use?
Thank you.
Man, your software worked great! Two things:
1) Is it possible to use SeqFilter this way?:
Having these groups of orthologous:
i.e.:
I want to find each of the five proteins (3 of which are in the first line, and the other two in the second) in another FASTA file and erase them (that is, erasing the protein code [e.g.: 0023|379.49.peg.3509] and the subjacent aminoacid sequence), so that any other protein not being in the first file will remain in an output file.
2) How can I cite RefSeq?
Once again, I am deeply thankful for your guidance.
Greetings from Mexico.
I'm not 100% sure I fully understand what you want to do - my guess - you want to extract the 5 sequences from file A and move them to file B. You can kind of do that in two steps:
I did not publish SeqFilter as a stand-alone tool, you can either cite just the github page, or you can use https://github.com/BioInf-Wuerzburg/proovread#citing-proovread, which is a program I developed and which includes SeqFilter as an important module.
Wait, I've just tried by leaving the - sign, just as you originally posted, and it seems to work perfect.
Thank you again. If anything comes up, I will seek your guidance, if it OK.