Hi all,
I am currently trying to fill in annotations from a BLASTP of the same database into rows that are 0 in the BLASTX. The columns look like this:
BLASTX BLASTP
annot1 annot1
annot 2 0
0 annot3
In this example I want "annot3" to be moved under the BLASTX column, so that the column contains origional BLASTX annotations and where empty gets filled by the BLASTP annotation like so:
BLASTX BLASTP
annot1 annot1
annot 2 0
annot3 annot3
The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.
I have been trying to do this manually LibreOffice, but with 160,000 sequences I realized the only practical way would be to use awk or python scripting.
Can you show an example of the annotations, to see delimiters and possible ids?
Sure, LibreOffice is frozen at the moment the two columns are TAB delimited and the annotations within are delimited with '^'. The identifier I need is the one before the first ^, however my plan was to take the new list of annotations and open in LibreOffice, specifying it as '^' delimited and then copying the first column (the protein ID) into a new list, and then use this script:
To parse out the gene name.
Then I was going to use this script to get gene family counts:
As soon LibreOffice stop freezing I'll post an example of the two columns. I was going to at first but it didn't look pretty because the annotations were wider than the page.
best, -j
This is messy, and you're naming your list as the same as what you're opening your file as.
Try this:
Thank you, that is much cleaner.
Finally LibreOffice unfroze (wish I could force it to use all 4 cpu)!
Here is an example of the columns: