Question

Subset gff with non-unique seqid

0

Entering edit mode

4.1 years ago

hazirliver ▴ 10

Hi! I have a gene id and a GFF file downloaded from IMG JGI. I want to make a subset of the GFF file, which will have all the genes on the same scaffold with the gene in have. The problem is that the values of the seqid field in this GFF file are not unique, that is, there are several scaffolds with different genes, and I get extra values. For example, GFF file looks like this (i added an extra delimiter (----) to emphasize that these are different scaffolds, and the original file does not have it):

SRS014683_WUGC_scaffold_10924       CDS 206 349 .   1   0   ID=SRS014683_WUGC_scaffold_10924__gene_15568;locus_tag=SRS014683_WUGC_scaffold_10924__gene_15568;
-------------
SRS014683_WUGC_scaffold_10932       CDS 3   134 .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15570;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15570;
SRS014683_WUGC_scaffold_10932       CDS 1318    1674    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15572;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15572;
SRS014683_WUGC_scaffold_10932       CDS 1934    2185    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15574;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15574;
SRS014683_WUGC_scaffold_10932       CDS 3753    4013    .   -1  0   ID=SRS014683_WUGC_scaffold_10932__gene_15576;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15576;
SRS014683_WUGC_scaffold_10932       CDS 4517    4741    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15578;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15578;
---------------
SRS014683_WUGC_scaffold_10926       CDS 2   385 .   -1  0   ID=SRS014683_WUGC_scaffold_10926__gene_15569;locus_tag=SRS014683_WUGC_scaffold_10926__gene_15569;
---------------
SRS014683_WUGC_scaffold_10932       CDS 679 876 .   -1  0   ID=SRS014683_WUGC_scaffold_10932__gene_15571;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15571;
SRS014683_WUGC_scaffold_10932       CDS 1773    1937    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15573;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15573;
SRS014683_WUGC_scaffold_10932       CDS 2266    3480    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15575;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15575;

And I want a subset of this GFF with gene ID = SRS014683_WUGC_scaffold_10932__gene_15572, which is located at SRS014683_WUGC_scaffold_10932, but this scaffold is not unique. The current implementation is based on searching for matches on the ID field, then subsetting all matches on the seqid field, which gives extra occurrences. The expect output is only

SRS014683_WUGC_scaffold_10932       CDS 3   134 .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15570;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15570;
SRS014683_WUGC_scaffold_10932       CDS 1318    1674    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15572;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15572;
SRS014683_WUGC_scaffold_10932       CDS 1934    2185    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15574;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15574;
SRS014683_WUGC_scaffold_10932       CDS 3753    4013    .   -1  0   ID=SRS014683_WUGC_scaffold_10932__gene_15576;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15576;
SRS014683_WUGC_scaffold_10932       CDS 4517    4741    .   1   0   ID=SRS014683_WUGC_scaffold_10932__gene_15578;locus_tag=SRS014683_WUGC_scaffold_10932__gene_15578;

So the question is, is there a way to avoid this problem? Maybe there are ready-made tools that do this?

gff • 1.2k views

ADD COMMENT • link 4.1 years ago by hazirliver ▴ 10

score 0 · Answer 1 · 2021-06-04

0

Entering edit mode

4.1 years ago

Juke34 9.3k

I’m not sure to understand your problem. Seq_id are uniques... annotations on a seq_id are not consécutive in your file, even if it sound weird this is not a problem. What are the other lines with the same gene_ID? Please provide an Example. Btw I guess you talk about ID and not gene_ID because I don’t see any gene_ID attribute.

Maybe You should run AGAT to check/clean your file (gxf2gxf)

ADD COMMENT • link 4.1 years ago by Juke34 9.3k

0

Entering edit mode

Yes, you are right, it is ID. The problem is that I have 3000 files downloaded from IMG JGI, almost all of which have GFF files in this format. For example, IMG Genome ID = 7000000715. You can download the archive from their website, one of the files will be .gff (link on .gff on google drive). Open this file in a text editor, you can search the file with SRS014683_WUGC_scaffold_10932 and you will see that this seqid appears in two places in the file and contains information about different genes. And almost all files have this problem.

ADD REPLY • link 4.1 years ago by hazirliver ▴ 10