awk filed with different separator
4
1
Entering edit mode
8.9 years ago
sacha ★ 2.4k

Hi,

Could provide the faster way to filter this data :

chr1    43  1000    gene_name=boby  gene_type=trucA
chr2    44  1000    gene_name=natt  gene_type=trucB  
chr3    45  1000    gene_name=alurika   gene_type=trucC

To :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC

CORRECTION : Original text data looks like this :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34
awk oneliner • 2.8k views
ADD COMMENT
1
Entering edit mode

What did you try so far? Is this really a bioinformatics question? What is the logic between input and output change? It looks rather random to me, e.g. "..trucB chr3 45 1000.." becomes "..trucB chr1 45 1000.."?

ADD REPLY
0
Entering edit mode

I just answered it, but I agree with you, regarding chr3 -> chr1, I think its just a typo!

ADD REPLY
5
Entering edit mode
8.9 years ago
sed -e 's/gene_name=//g' -e 's/gene_type=//g' file > file2
ADD COMMENT
3
Entering edit mode
8.9 years ago
sacha ★ 2.4k

Thanks for your reply !

But I just discover right now that awk support regexp for the Fieldseperator. So this works too :

cat test | awk 'BEGIN{FS="\t|="} {print $1,$2,$3,$5,$7}'
ADD COMMENT
4
Entering edit mode

Double quotes just left hanging. That makes me sad.

ADD REPLY
3
Entering edit mode

Lets make John happy :D

ADD REPLY
1
Entering edit mode

I think still you do not get the desired output with the awk you are showing , but with sed you actually get the desired output as you put in your original question. And yes the double quotes are not closed.

ADD REPLY
0
Entering edit mode

Sure, this would take care of any pattern after a tab and before "=", my answer is valid if you only want to replace these two strings.

ADD REPLY
3
Entering edit mode
8.9 years ago
GenoMax 148k

Try

$ sed -e 's/;/\ /g' your_file | sed -e 's/=/\ /g' | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'
ADD COMMENT
2
Entering edit mode

Removing redundancy

$ sed -e 's/;/\ /g' -e 's/=/\ /g' your_file | awk -F " " '{print $1"\t"$2"\t"$3"\t"$6"\t"$8}'
ADD REPLY
0
Entering edit mode

Perfect I was about to write this.. But in any case the OP should actually think why such formatting is required. I believe these are vcf file showing variant names with positions , in that case pre filtering should be done to keep only those that have the column with string PASS. In that case it should be:

cat file.txt | grep "PASS" | sed -e 's/;/\ /g' -e 's/=/\ /g' | awk -F " " '{print $1"\t",$2"\t",$3"\t",$4"\t",$6"\t"$8}' > file_flt.txt

Otherwise @genomax2 is correct about what you need.

ADD REPLY
0
Entering edit mode

Can i just say, i love the new syntax highlighting going on here :D

(particularly how Istavan has coloured the popular bioinformatics program names red. very nice touch)

ADD REPLY
0
Entering edit mode
8.9 years ago
sacha ★ 2.4k

Ok, But actually my exemple was not complete... They are more data :

chr1    43  1000    TEST   gene_name=boby;gene_type=trucA;foo=34
chr2    44  1000    TRUC  gene_name=natt;gene_type=trucB;foo=34  
chr3    45  1000    PASS  gene_name=alurika;gene_type=trucC;foo=34

How to get :

chr1  43  1000 boby  trucA
chr1  44  1000 natt  trucB
chr1  45  1000  alurika  trucC
ADD COMMENT
1
Entering edit mode

You should really take this to heart, that when you don't fully know your data formatting, never use a regex. Regex's are great for grabbing things. They are a really bad idea for data manipulation (not to mention they're also slow)

ADD REPLY
0
Entering edit mode

Yes I understand the feeling, so I updated a comment with my understanding and asked the OP if what ideally should the person be looking for and modified the command line.

ADD REPLY
0
Entering edit mode

With this?

sed 's/=/\t/g' file.txt | sed 's/;/\t/g' | awk '{print $1,$2,$3,$6,$8}'

Some clarification, in your input file there are 3 chromosomes but in output file only one chr? Is that what you really need?

ADD REPLY
0
Entering edit mode

It is a typo I believe? Since it does not make sense to change everything to chr1

ADD REPLY

Login before adding your answer.

Traffic: 2886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6