File processing, arranging columns
3
0
Entering edit mode
7.0 years ago
AP ▴ 80

Hello everyone,

I have file1 which is tab delimited with following format :

g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
pfam    PF15    FAD binding domain  
g4  pfam    PF16    RTA1 like protein
pfam    PF17    Zinc-binding dehydrogenase  
pfam    PF18    major facilitator superfamily   
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

I want to arrange this file such that the lines beginning with pfam will include gene id(g3 or g4, etc.) from previous line. The output file that I want is also tab delimited and looks like this:

g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
g3  pfam    PF15    FAD binding domain
g4  pfam    PF16    RTA1 like protein
g4  pfam    PF17    Zinc-binding dehydrogenase
g4  pfam    PF18    major facilitator superfamily
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

Many thanks in advance.

Ambika

awk linux bash • 2.0k views
ADD COMMENT
1
Entering edit mode
7.0 years ago
5heikki 11k

Assuming that there's no empty first field:

awk 'BEGIN{OFS=FS="\t";first=""}{if(NF==4){first=$1;print $0}else{print first,$0}}' input > output

If first field is empty:

awk 'BEGIN{OFS=FS="\t";first=""}{if($1!=""){first=$1; print $0}else{print first,$2,$3,$4}}' input > output
ADD COMMENT
0
Entering edit mode

Nice one-liner. One question: You don't really nead first="" in the BEGIN statement?

ADD REPLY
0
Entering edit mode

Thank you. I will try that.

ADD REPLY
1
Entering edit mode
7.0 years ago

output using sed and awk:

$ sed 's/^pfam/\tpfam/g' test.txt |awk -F "\t" -v OFS="\t" '{if($1=="") {$1=previous} previous=$1}1'  
g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
g3  pfam    PF15    FAD binding domain  
g4  pfam    PF16    RTA1 like protein
g4  pfam    PF17    Zinc-binding dehydrogenase  
g4  pfam    PF18    major facilitator superfamily   
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter

input (tab separated):

$ cat test.txt 
g1  pfam    PF12    ABC transporter
g2  pfam    PF13    transcription factor
g3  pfam    PF14    glycosyl hydrolase
pfam    PF15    FAD binding domain  
g4  pfam    PF16    RTA1 like protein
pfam    PF17    Zinc-binding dehydrogenase  
pfam    PF18    major facilitator superfamily   
g5  pfam    PF19    short chain dehydrogenase
g6  pfam    PF20    ABC transporter
ADD COMMENT
0
Entering edit mode

This looks really neat.. I am gonna try and keep you posted and thanks

ADD REPLY
0
Entering edit mode

cpad0112. Thank you so much it worked!!

ADD REPLY
0
Entering edit mode
7.0 years ago
Hussain Ather ▴ 990

This should work in python as long as you replace the "input.txt" and "output.txt" with the paths to your input file and desired output file.

f = open("input.txt", "r")
o = open("output.txt", "w")

lists = []

for line in f.readlines():
        temp_list = []
        index = 3
        if line.startswith("p"):
                temp_list.append(prev_g)
                index = 2
        for j in line.split()[:index]:
                temp_list.append(j)
        temp_list.append(" ".join(line.split()[index:]))
        lists.append(temp_list)
        if line.startswith("g"):
                prev_g = line.split()[0]

for i in lists:
        for j in i:
                o.write(str(j))
                o.write("\t")
        o.write("\n")
ADD COMMENT
0
Entering edit mode

I have never used python but I can give it a shot. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 2614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6