Question

Parsing gff files using python and awk

0

Entering edit mode

7.7 years ago

Fatima ▴ 1000

I have two text file that have lines like this: NC_013520.1 RefSeq gene 2229 3341 . + . ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002

I want to extract gene ids of all genes, and then compare it with another file that I have. I wrote this script but it doesn't consider the white spaces!

awk FS= "{\t ;:}" '($3 == "gene") && ($11=="Dbxref=GeneID") {print $1,$4,$5,$7,$12} ($3=="gene") && ($12=="Dbxref=GeneID") {print $1,$4,$5,$7,$13}' gff.gff

Output:

NC_013520.1 2131416 2131550 - 8637363

I can still use the output, but I want to learn how to also consider spaces as delimiter!

Also when I have this output ready and a similar output I want to compare their fields for example to find similar GeneIDs in both files.

awk • 4.6k views

ADD COMMENT • link updated 7.7 years ago by WouterDeCoster 47k • written 7.7 years ago by Fatima ▴ 1000

1

Entering edit mode

I think FS= "{\t ;:}" will not compile, it should be -F"[\t ;:]" or -v FS="[\t ;:]" (I don't know if your script will do what you expect after this).

Also, if you use awk you assume that each line has the same number of fields in the same order. This is fine for the first 8 columns of the gtf, but for the "attribute" column there is no guarantee that lines have the same format (maybe in your case it's ok). I would prefer to explicitly look for the field you are interested, I think this is what shenwei356's perl option does.

ADD REPLY • link 7.7 years ago by dariober 15k

score 3 · Answer 1 · 2017-03-21

3

Entering edit mode

7.7 years ago

shenwei356 8.7k

It's not easy using awk, try Perl:

$ cat 1.gff | perl -ane '$F[8]=~/GeneID:(\d+)/; print "$F[0]\t$F[3]\t$F[4]\t$F[6]\t$1\n";'
NC_013520.1     2229    3341    +       8635442

ADD COMMENT • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

Thanks, what if I want to extract the gene name? I tried $F[8]=~/Name:(\c+)/ didn't work.

ADD REPLY • link 7.7 years ago by Fatima ▴ 1000

1

Entering edit mode

shoud be \w not \c for letters or numbers. learn more about the regular expression

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

score 2 · Answer 2 · 2017-03-21

How about retrieving the GeneID as a new column and discard it after downstream analysis.

I'm not familiar with awk, but you can try csvtk.

Sample data:

$ cat 1.gff 
NC_013520.1     RefSeq  gene    2229    3341    .       +       .       ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002

Retrieve the GeneID using regular expression as a new column (10th column).

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)'
NC_013520.1     RefSeq  gene    2229    3341    .       +       .       ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002    8635442

Select columns you want:

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)' | csvtk cut -H -t -f 1,4,5,7,10
NC_013520.1     2229    3341    +       8635442

Do somethings....

Remove the last (10th) column:

$ cat 10-columns | csvtk -H -t cut -f -10

You may need use csvtk inter to find common GeneID of two or more files.

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)' > 1.m.gff
$ csvtk inter -H -t -f 10 1.m.gff 2.m.gff
8635442