Remove column that not contain specific word
1
0
Entering edit mode
2.2 years ago

Hi,

I have a table like that

>Feature gnl|XXX|IFEJKLFI_1 gnl|XXX|IFEJKLFI_1
            locus_tag   IFEJKLFI_00001 IFEJKLFI_00001
            locus_tag   IFEJKLFI_00001 IFEJKLFI_00001
            protein_id  gnl|XXX|IFEJKLFI_00001 gnl|XXX|IFEJKLFI_00001
            locus_tag   IFEJKLFI_00002 IFEJKLFI_00002
            locus_tag   IFEJKLFI_00002 IFEJKLFI_00002
            protein_id  gnl|XXX|IFEJKLFI_00002 gnl|XXX|IFEJKLFI_00002
            locus_tag   IFEJKLFI_00419 IFEJKLFI_00419
            locus_tag   IFEJKLFI_00419 IFEJKLFI_00419
            protein_id  gnl|XXX|IFEJKLFI_00419 gnl|XXX|IFEJKLFI_00419
>Feature gnl|XXX|IFEJKLFI_2 gnl|XXX|IFEJKLFI_2
            locus_tag   IFEJKLFI_00423 IFEJKLFI_00423
            locus_tag   IFEJKLFI_00423 IFEJKLFI_00423
            protein_id  gnl|XXX|IFEJKLFI_00423 gnl|XXX|IFEJKLFI_00423
>Feature gnl|XXX|IFEJKLFI_3 gnl|XXX|IFEJKLFI_3
>Feature gnl|XXX|IFEJKLFI_4 gnl|XXX|IFEJKLFI_4
>Feature gnl|XXX|IFEJKLFI_5 gnl|XXX|IFEJKLFI_5

I want to remove rows that not contain locus_tag

So, I want the table to look like that and remove the >Feature rows that contain nothing.

>Feature gnl|XXX|IFEJKLFI_1 gnl|XXX|IFEJKLFI_1
            locus_tag   IFEJKLFI_00001 IFEJKLFI_00001
            locus_tag   IFEJKLFI_00001 IFEJKLFI_00001
            locus_tag   IFEJKLFI_00002 IFEJKLFI_00002
            locus_tag   IFEJKLFI_00002 IFEJKLFI_00002
            locus_tag   IFEJKLFI_00419 IFEJKLFI_00419
            locus_tag   IFEJKLFI_00419 IFEJKLFI_00419
>Feature gnl|XXX|IFEJKLFI_2 gnl|XXX|IFEJKLFI_2
            locus_tag   IFEJKLFI_00423 IFEJKLFI_00423
            locus_tag   IFEJKLFI_00423 IFEJKLFI_00423

Could you please help me with that using awk or any other method?

Thanks!!

linux awk • 1.1k views
ADD COMMENT
1
Entering edit mode
2.2 years ago
Matt ▴ 10

I think grep would do what you need.

On a unix command line, try cat file.txt | grep -v protein_id > output_file.txt

Alternatively, if you want to get rid of more than protein_id, you could try cat file.txt | grep "^>Feature\|locus_tag" > output_file.txt to keep only lines that start with ">Feature" or contain "locus_tag"

ADD COMMENT
0
Entering edit mode

Hi, Thanks for the quick reply!

This command removed protein_ID, but the other rows that do not contain locus_tag keept as they are.

To elaborate more, the output was like that

>Feature gnl|XXX|IFEJKLFI_1 gnl|XXX|IFEJKLFI_1
            locus_tag   IFEJKLFI_00001 IFEJKLFI_00001
            locus_tag   IFEJKLFI_00001 IFEJKLFI_00001
            locus_tag   IFEJKLFI_00002 IFEJKLFI_00002
            locus_tag   IFEJKLFI_00002 IFEJKLFI_00002
            locus_tag   IFEJKLFI_00419 IFEJKLFI_00419
            locus_tag   IFEJKLFI_00419 IFEJKLFI_00419
>Feature gnl|XXX|IFEJKLFI_2 gnl|XXX|IFEJKLFI_2
            locus_tag   IFEJKLFI_00423 IFEJKLFI_00423
            locus_tag   IFEJKLFI_00423 IFEJKLFI_00423
>Feature gnl|XXX|IFEJKLFI_3 gnl|XXX|IFEJKLFI_3
>Feature gnl|XXX|IFEJKLFI_4 gnl|XXX|IFEJKLFI_4
>Feature gnl|XXX|IFEJKLFI_5 gnl|XXX|IFEJKLFI_5

I need to remove the other >Feature lines that do not contain locus_tag.

ADD REPLY
0
Entering edit mode

Ah, ok, that's a bit trickier. Without a file to test with it's a bit hard to make reliable code. I'll assume that there is a tab before "locus_tag" for this bit of code, so you may need to tweak it if there is other whitespace instead:

cat file.txt \
    | grep "^>Feature\|locus_tag" \
    | perl -pe 's/\n\tlocus/\tlocus/g' \
    | grep locus \
    | perl -pe 's/\tlocus/\n\tlocus/g' \
    > output.txt

The first perl line moves the "locus_tag" lines to the same line of the the ">Feature" they match. Then we grab only the lines that still have "locus" so that ">Feature" lines w/o anything else get dropped. Then I re-use a perl one-liner to put the locus_tag lines after their ">Feature" lines.

Like I said, without a test file I can't be 100% sure this will work but hopefully that will give you some code to tweak to get what you want.

Note that there are no spaces after the "\" at the end of each line or else the backslash don't work properly.

Good luck!

Matt

ADD REPLY
0
Entering edit mode

Hi, Thanks for the reply.

The output looked like (see below) without the >Feature line and I need to know which locus_tag belongs to which Feature

Also, I can send you part of the file, if it is possible.

locus_tag       IFEJKLFI_00001  IFEJKLFI_00001
locus_tag       IFEJKLFI_00001  IFEJKLFI_00001
locus_tag       IFEJKLFI_00002  IFEJKLFI_00002
locus_tag       IFEJKLFI_00002  IFEJKLFI_00002
locus_tag       IFEJKLFI_00419  IFEJKLFI_00419
locus_tag       IFEJKLFI_00419  IFEJKLFI_00419
locus_tag       IFEJKLFI_00423  IFEJKLFI_00423
locus_tag       IFEJKLFI_00423  IFEJKLFI_00423
ADD REPLY
0
Entering edit mode

ok, try #3:

cat file.txt \
    | grep "^>Feature\|locus_tag" \
    | perl -pe 's/\n/\t/g' \
    | perl -pe 's/\t>Feature/\n>Feature/g' \
    | grep locus \
    | perl -pe 's/\t\s+locus/\n            locus/g' \
    > outfile.txt

Try that. Double check that I have the number of spaces right in the last perl -pe call.

Matt

ADD REPLY
1
Entering edit mode

It worked.

Thanks a bundle!!!!!!!

ADD REPLY

Login before adding your answer.

Traffic: 1309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6