Hi! I have a gene id and a GFF file downloaded from IMG JGI. I want to make a subset of the GFF file, which will have all the genes on the same scaffold with the gene in have. The problem is that the values of the seqid field in this GFF file are not unique, that is, there are several scaffolds with different genes, and I get extra values. For example, GFF file looks like this (i added an extra delimiter (----) to emphasize that these are different scaffolds, and the original file does not have it):
SRS014683_WUGC_scaffold_10924 CDS 206 349 . 1 0 ID=SRS014683_WUGC_scaffold_10924__gene_15568;locus_tag=SRS014683_WUGC_scaffold_10924__gene_15568;
-------------
SRS014683_WUGC_scaffold_10932 CDS 3 134 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15570;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15570;
SRS014683_WUGC_scaffold_10932 CDS 1318 1674 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15572;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15572;
SRS014683_WUGC_scaffold_10932 CDS 1934 2185 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15574;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15574;
SRS014683_WUGC_scaffold_10932 CDS 3753 4013 . -1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15576;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15576;
SRS014683_WUGC_scaffold_10932 CDS 4517 4741 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15578;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15578;
---------------
SRS014683_WUGC_scaffold_10926 CDS 2 385 . -1 0 ID=SRS014683_WUGC_scaffold_10926__gene_15569;locus_tag=SRS014683_WUGC_scaffold_10926__gene_15569;
---------------
SRS014683_WUGC_scaffold_10932 CDS 679 876 . -1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15571;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15571;
SRS014683_WUGC_scaffold_10932 CDS 1773 1937 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15573;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15573;
SRS014683_WUGC_scaffold_10932 CDS 2266 3480 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15575;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15575;
And I want a subset of this GFF with gene ID = SRS014683_WUGC_scaffold_10932__gene_15572
, which is located at SRS014683_WUGC_scaffold_10932
, but this scaffold is not unique. The current implementation is based on searching for matches on the ID field, then subsetting all matches on the seqid field, which gives extra occurrences. The expect output is only
SRS014683_WUGC_scaffold_10932 CDS 3 134 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15570;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15570;
SRS014683_WUGC_scaffold_10932 CDS 1318 1674 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15572;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15572;
SRS014683_WUGC_scaffold_10932 CDS 1934 2185 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15574;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15574;
SRS014683_WUGC_scaffold_10932 CDS 3753 4013 . -1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15576;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15576;
SRS014683_WUGC_scaffold_10932 CDS 4517 4741 . 1 0 ID=SRS014683_WUGC_scaffold_10932__gene_15578;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15578;
So the question is, is there a way to avoid this problem? Maybe there are ready-made tools that do this?
Yes, you are right, it is
ID
. The problem is that I have 3000 files downloaded from IMG JGI, almost all of which have GFF files in this format. For example,IMG Genome ID = 7000000715
. You can download the archive from their website, one of the files will be .gff (link on .gff on google drive). Open this file in a text editor, you can search the file withSRS014683_WUGC_scaffold_10932
and you will see that this seqid appears in two places in the file and contains information about different genes. And almost all files have this problem.