Combining two columns in a tab-delimited file where there is a zero in one
1
0
Entering edit mode
7.3 years ago
jvire1 ▴ 10

Hi all,

I am currently trying to fill in annotations from a BLASTP of the same database into rows that are 0 in the BLASTX. The columns look like this:

BLASTX     BLASTP
annot1     annot1
annot 2    0
0          annot3

In this example I want "annot3" to be moved under the BLASTX column, so that the column contains origional BLASTX annotations and where empty gets filled by the BLASTP annotation like so:

BLASTX     BLASTP
annot1     annot1
annot 2    0
annot3     annot3

The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.

I have been trying to do this manually LibreOffice, but with 160,000 sequences I realized the only practical way would be to use awk or python scripting.

RNA-Seq • 2.0k views
ADD COMMENT
0
Entering edit mode

The ultimate reason for this is so that I can make a list of identifiers for downstream purposes.

Can you show an example of the annotations, to see delimiters and possible ids?

ADD REPLY
0
Entering edit mode

Sure, LibreOffice is frozen at the moment the two columns are TAB delimited and the annotations within are delimited with '^'. The identifier I need is the one before the first ^, however my plan was to take the new list of annotations and open in LibreOffice, specifying it as '^' delimited and then copying the first column (the protein ID) into a new list, and then use this script:

   with open('acc_list.txt', 'r') as f:
        acc = [line.strip() for line in f]

    with open('PlantTFDB_ALL_TF_pep.fas', 'r') as f:
        for line in f:
            if line.startswith(">"):
                if line.strip().split('>')[1].split(' ')[0] in acc:
                    print line.strip().split('|')[1]

To parse out the gene name.

Then I was going to use this script to get gene family counts:

    #!/usr/bin/env python
sourcefile = "gene_name.txt"

with open(sourcefile, "r") as germ:
    germ = [item.lower().replace("\n", "") for item in germ.readlines()]
for item in sorted(set(germ)):
    print item.title(), germ.count(item)

As soon LibreOffice stop freezing I'll post an example of the two columns. I was going to at first but it didn't look pretty because the annotations were wider than the page.

best, -j

ADD REPLY
1
Entering edit mode

This is messy, and you're naming your list as the same as what you're opening your file as.

with open(sourcefile, "r") as germ:
    germ = [item.lower().replace("\n", "") for item in germ.readlines()]

Try this:

with open('gene_name.txt', 'r') as source:
       germ = [line.strip().lower() for line in source]
for i in set(germ):
        print i.title(), germ.count(i)
ADD REPLY
0
Entering edit mode

Thank you, that is much cleaner.

ADD REPLY
0
Entering edit mode

Finally LibreOffice unfroze (wish I could force it to use all 4 cpu)!

Here is an example of the columns:

PlantTFDB_ALL_TF_pep_BLASTX PlantTFDB_ALL_TF_pep_BLASTP
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:6e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:9-95,H:5-93^44.94%ID^E:2e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:157-414,H:5-92^44.32%ID^E:9e-17^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:53-138,H:5-92^44.32%ID^E:8e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:46-306,H:5-93^44.94%ID^E:1e-16^.^.  Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:16-102,H:5-93^44.94%ID^E:2e-17^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:567-307,H:5-93^44.94%ID^E:2e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:1e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:488-228,H:5-93^44.94%ID^E:1e-15^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:125-211,H:5-93^44.94%ID^E:2e-16^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:82-360,H:1-93^84.95%ID^E:5e-51^.^.  Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^84.95%ID^E:2e-53^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:681-403,H:1-93^84.95%ID^E:1e-50^.^. Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^84.95%ID^E:2e-53^.^.
Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1603-1881,H:1-93^83.87%ID^E:2e-46^.^.   Zpz_sc04830.1.g00020.1.sm.mk^Zpz_sc04830.1.g00020.1.sm.mk^Q:1-93,H:1-93^83.87%ID^E:2e-52^.^.
KZV22762.1^KZV22762.1^Q:824-1537,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
KZV22762.1^KZV22762.1^Q:811-1524,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
KZV22762.1^KZV22762.1^Q:721-1434,H:1151-1396^40.96%ID^E:3e-37^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:69-199,H:1-128^54.14%ID^E:3e-43^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:7-387,H:8-116^37.8%ID^E:3e-12^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:3-129,H:8-116^37.01%ID^E:1e-14^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:853-551,H:27-118^37.62%ID^E:5e-14^.^.   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:3-103,H:27-118^37.62%ID^E:3e-15^.^.
0   Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:27-128,H:32-118^32.35%ID^E:7e-06^.^.
Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:703-311,H:1-126^83.97%ID^E:7e-64^.^.    Zpz_sc04699.1.g00010.1.am.mk^Zpz_sc04699.1.g00010.1.am.mk^Q:174-304,H:1-126^83.97%ID^E:6e-68^.^.
ADD REPLY
3
Entering edit mode
7.3 years ago
kashiff007 ★ 1.9k

If both column are in different files, first convert both your file in txt. Then:

paste blastx.txt blastp.txt | awk '{if($1=="0")print$2"\t"$2; else print}' > ur_output.txt

If in same file:

awk '{if($1=="0")print$2"\t"$2; else print}' input.txt > ur_output.txt
ADD COMMENT
0
Entering edit mode

Thank you so much for this one liner and to all those who helped. Combined these scripts have allowed me to achieve my goal:

Bhlh    533
Myb_Related 397
C2H2    355
Nac 352
C3H 337
Erf 297
Myb 265
Wrky    263
B3  249
Bzip    217
Far1    196
G2-Like 167
Gras    157
M-Type_Mads 134
Trihelix    134
Arf 128
Mikc_Mads   123
Hd-Zip  120
Lbd 115
Gata    91
ADD REPLY

Login before adding your answer.

Traffic: 1959 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6