Parsing gff files using python and awk
2
0
Entering edit mode
7.7 years ago
Fatima ▴ 1000

I have two text file that have lines like this: NC_013520.1 RefSeq gene 2229 3341 . + . ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002

I want to extract gene ids of all genes, and then compare it with another file that I have. I wrote this script but it doesn't consider the white spaces!

awk FS= "{\t ;:}" '($3 == "gene") && ($11=="Dbxref=GeneID") {print $1,$4,$5,$7,$12} ($3=="gene") && ($12=="Dbxref=GeneID") {print $1,$4,$5,$7,$13}' gff.gff

Output:

NC_013520.1 2131416 2131550 - 8637363

I can still use the output, but I want to learn how to also consider spaces as delimiter!

Also when I have this output ready and a similar output I want to compare their fields for example to find similar GeneIDs in both files.

awk • 4.6k views
ADD COMMENT
1
Entering edit mode

I think FS= "{\t ;:}" will not compile, it should be -F"[\t ;:]" or -v FS="[\t ;:]" (I don't know if your script will do what you expect after this).

Also, if you use awk you assume that each line has the same number of fields in the same order. This is fine for the first 8 columns of the gtf, but for the "attribute" column there is no guarantee that lines have the same format (maybe in your case it's ok). I would prefer to explicitly look for the field you are interested, I think this is what shenwei356's perl option does.

ADD REPLY
3
Entering edit mode
7.7 years ago

It's not easy using awk, try Perl:

$ cat 1.gff | perl -ane '$F[8]=~/GeneID:(\d+)/; print "$F[0]\t$F[3]\t$F[4]\t$F[6]\t$1\n";'
NC_013520.1     2229    3341    +       8635442
ADD COMMENT
0
Entering edit mode

Thanks, what if I want to extract the gene name? I tried $F[8]=~/Name:(\c+)/ didn't work.

ADD REPLY
1
Entering edit mode

shoud be \w not \c for letters or numbers. learn more about the regular expression

ADD REPLY
2
Entering edit mode
7.7 years ago

How about retrieving the GeneID as a new column and discard it after downstream analysis.

I'm not familiar with awk, but you can try csvtk.

Sample data:

$ cat 1.gff 
NC_013520.1     RefSeq  gene    2229    3341    .       +       .       ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002

Retrieve the GeneID using regular expression as a new column (10th column).

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)'
NC_013520.1     RefSeq  gene    2229    3341    .       +       .       ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002    8635442

Select columns you want:

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)' | csvtk cut -H -t -f 1,4,5,7,10
NC_013520.1     2229    3341    +       8635442

Do somethings....

Remove the last (10th) column:

$ cat 10-columns | csvtk -H -t cut -f -10

You may need use csvtk inter to find common GeneID of two or more files.

$ cat 1.gff |  csvtk mutate -H -t -f 9 -p 'GeneID:(\d+)' > 1.m.gff
$ csvtk inter -H -t -f 10 1.m.gff 2.m.gff
8635442
ADD COMMENT

Login before adding your answer.

Traffic: 2337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6