How to filter GTF file and only print lines where the tag is a certain value?
1
0
Entering edit mode
5.2 years ago
Tails ▴ 80

I'm trying to only print lines in a GTF file with the "tag" field "appris_principal" AND if that tag doesn't exist, then the ones tagged with "appris_candidate_longest" are selected, for any given gene.

I think I can code it up in python but there must be a way to do it in awk?

GTF awk • 3.7k views
ADD COMMENT
0
Entering edit mode

Why not grep? That might be the easiest and the quickest.

ADD REPLY
0
Entering edit mode

Oh yeah let's not forget grep. But I'm not sure how to make the condition if appris_principal doesn't exist in this line, check whether appris_candidate_longest exists. I neither, don't print.

ADD REPLY
0
Entering edit mode

Extract matching lines:

grep -e "appris_principal" ...

Extract non-matching lines:

grep -e "appris_principal" -v ...

... ...

Check for lines that have appris_candidate_longest AND NOT appris_principal:

grep -e "appris_principal" -v ... | grep -e "appris_candidate_longest"
ADD REPLY
0
Entering edit mode

honestly I would just do it in Python. Use csv.DictReader. Shouldnt take more than a dozen lines.

ADD REPLY
1
Entering edit mode
5.2 years ago
ssb.pranav ▴ 10

Hi, Since you want to try in awk, I came up with this one-liner--

cat your_gtf_file | grep -v "#" | cut -d "       " -f9| cut -d";" -f1-18 | awk '{ if ($0~"appris_principle")print $0; else if ($0~"appris_candidate_longest")print $13}'

Let me know if this works!

Thanks

ADD COMMENT
0
Entering edit mode

Thanks! That's almost exactly what I'm looking for, except even when I change the "print $13" at the end to "print $0", I still don't get the entire line, only the 9th column. Doing the cut -f9 so early in the process means you discarded the rest of the line, right? How do I get it to print the entire line?

ADD REPLY
0
Entering edit mode

Oh sorry, it's also missing the condition: for each gene, meaning the if and else if should apply to each gene (for each unique string after the gene_name field), rather than each line.

ADD REPLY
0
Entering edit mode

I am glad you tried. So you mean you want to print the whole line which satisfies the condition, including all the columns from start?

If that's the case, this will work

awk '{ if ($0~"appris_principle")print $0; else if ($0~"appris_candidate_longest")print $0}' your.gtf

I am not sure why you want to apply after each gene_name and not to whole line.

Could you please post the output example as well as the sample input so that I might get it right, the way you want?

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1638 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6