I have two text file that have lines like this: NC_013520.1 RefSeq gene 2229 3341 . + . ID=gene1;Name=Vpar_0002;Dbxref=GeneID:8635442;gbkey=Gene;locus_tag=Vpar_0002
I want to extract gene ids of all genes, and then compare it with another file that I have. I wrote this script but it doesn't consider the white spaces!
awk FS= "{\t ;:}" '($3 == "gene") && ($11=="Dbxref=GeneID") {print $1,$4,$5,$7,$12} ($3=="gene") && ($12=="Dbxref=GeneID") {print $1,$4,$5,$7,$13}' gff.gff
Output:
NC_013520.1 2131416 2131550 - 8637363
I can still use the output, but I want to learn how to also consider spaces as delimiter!
Also when I have this output ready and a similar output I want to compare their fields for example to find similar GeneIDs in both files.
I think
FS= "{\t ;:}"
will not compile, it should be-F"[\t ;:]"
or-v FS="[\t ;:]"
(I don't know if your script will do what you expect after this).Also, if you use awk you assume that each line has the same number of fields in the same order. This is fine for the first 8 columns of the gtf, but for the "attribute" column there is no guarantee that lines have the same format (maybe in your case it's ok). I would prefer to explicitly look for the field you are interested, I think this is what shenwei356's perl option does.