How to trim a GFF3 file based on specific coordinates?
2
0
Entering edit mode
3.3 years ago
VenGeno ▴ 100

Hi,

I would like to create a GFF3 file containing information only for specific coordinates from the chromosome level GFF3 file. I know how to extract gene and CDS info separately but don't know how to do trimming based on coordinates. Can someone please assist me with this?

My GFF3 file looks like this; and I want to retrieve the region 10700000:16500000

##gff-version 3
##sequence-region   1 1 43270923
#!genome-build IRGSP IRGSP-1.0
#!genome-version IRGSP-1.0
#!genome-date 2015-10
#!genome-build-accession GCA_001433935.1
#!genebuild-last-updated 2019-06
1   IRGSP-1.0   chromosome  1   43270923    .   .   .   ID=chromosome:1;Alias=Chr1,AP014957.1,NC_029256.1
###
1   RAP2018-11-26   gene    2983    10815   .   +   .   ID=gene:Os01g0100100;biotype=protein_coding;description=RabGAP/TBC domain containing protein;gene_id=Os01g0100100;logic_name=rapdb_genes
1   RAP2018-11-26   mRNA    2983    10815   .   +   .   ID=transcript:Os01t0100100-01;Parent=gene:Os01g0100100;biotype=protein_coding;transcript_id=Os01t0100100-01
1   RAP2018-11-26   exon    2983    3268    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E1;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=Os01t0100100-01-E1;rank=1
1   RAP2018-11-26   five_prime_UTR  2983    3268    .   +   .   Parent=transcript:Os01t0100100-01
1   RAP2018-11-26   five_prime_UTR  3354    3448    .   +   .   Parent=transcript:Os01t0100100-01
1   RAP2018-11-26   exon    3354    3616    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E2;constitutive=1;ensembl_end_phase=0;ensembl_phase=-1;exon_id=Os01t0100100-01-E2;rank=2
1   RAP2018-11-26   CDS 3449    3616    .   +   0   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    4357    4455    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E3;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Os01t0100100-01-E3;rank=3
1   RAP2018-11-26   CDS 4357    4455    .   +   0   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    5457    5560    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E4;constitutive=1;ensembl_end_phase=2;ensembl_phase=0;exon_id=Os01t0100100-01-E4;rank=4
1   RAP2018-11-26   CDS 5457    5560    .   +   0   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    7136    7944    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E5;constitutive=1;ensembl_end_phase=1;ensembl_phase=2;exon_id=Os01t0100100-01-E5;rank=5
1   RAP2018-11-26   CDS 7136    7944    .   +   1   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    8028    8150    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E6;constitutive=1;ensembl_end_phase=1;ensembl_phase=1;exon_id=Os01t0100100-01-E6;rank=6
1   RAP2018-11-26   CDS 8028    8150    .   +   2   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    8232    8320    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E7;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;exon_id=Os01t0100100-01-E7;rank=7
1   RAP2018-11-26   CDS 8232    8320    .   +   2   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    8408    8608    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E8;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Os01t0100100-01-E8;rank=8
1   RAP2018-11-26   CDS 8408    8608    .   +   0   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    9210    9615    .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E9;constitutive=1;ensembl_end_phase=1;ensembl_phase=0;exon_id=Os01t0100100-01-E9;rank=9
1   RAP2018-11-26   CDS 9210    9615    .   +   0   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    10102   10187   .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E10;constitutive=1;ensembl_end_phase=0;ensembl_phase=1;exon_id=Os01t0100100-01-E10;rank=10
1   RAP2018-11-26   CDS 10102   10187   .   +   2   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   CDS 10274   10297   .   +   0   ID=CDS:Os01t0100100-01;Parent=transcript:Os01t0100100-01;protein_id=Os01t0100100-01
1   RAP2018-11-26   exon    10274   10430   .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E11;constitutive=1;ensembl_end_phase=-1;ensembl_phase=0;exon_id=Os01t0100100-01-E11;rank=11
1   RAP2018-11-26   three_prime_UTR 10298   10430   .   +   .   Parent=transcript:Os01t0100100-01
1   RAP2018-11-26   exon    10504   10815   .   +   .   Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E12;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=Os01t0100100-01-E12;rank=12
1   RAP2018-11-26   three_prime_UTR 10504   10815   .   +   .   Parent=transcript:Os01t0100100-01
###

Thank you!

Annotations GFF3 • 2.9k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Cool! Thank you !

ADD REPLY
2
Entering edit mode
3.3 years ago
Juke34 8.9k

You can use awk or gffread but it is not perfect. You will probably prefer AGAT: https://github.com/NBISweden/AGAT/issues/73

ADD COMMENT
1
Entering edit mode

@ VenGeno It's a > 500 mb installation.

ADD REPLY
0
Entering edit mode

Right, but it's rarely a problem nowadays, the installation procedure is more important (how easy it is to install or to use it. As most of modern tool one command is enough (for conda and container). Nevertheless I lighted it a bit in version 0.8.0 by removing R from the installation.

ADD REPLY
0
Entering edit mode

May not be true when one tries to install on ubuntu (focal 20.04) using wsl2 on windows 10. Recent conda uses python 3.9 and AGAT fails to install using bioconda repos as repos/libraries in repos are not compatible with Python 3.9. Tested today.

Conda warning:

agat_deps_conda

readline update conda:

read_line_update

Mamba warning:

mamba_warning

System: Windows 10, wsl2, ubuntu 20.04, conda 4.10.3, mamba 0.15.2, readline 8.1

ADD REPLY
0
Entering edit mode

Yes conda is not perfect it is why container can be preferable ^^

ADD REPLY
0
Entering edit mode
As most of modern tool one command is enough (for conda and container)

?

ps: Thank you for sharing the tool.

ADD REPLY
0
Entering edit mode

Ok maybe not 1 command but close to....

To get the tool with conda, 2 commands:

# get the tool
conda create -c bioconda -n agat agat 
conda activate agat
# use the tool
agat_sp_filter_record_by_coordinates.pl

With Container 1 command:

# get the tool
docker pull quay.io/biocontainers/agat:0.6.2--pl5262r35hdfd78af_0
# use the tool
docker run quay.io/biocontainers/agat:0.6.2--pl5262r35hdfd78af_0 agat_sp_filter_record_by_coordinates.pl
ADD REPLY
0
Entering edit mode

v0.0.1? Crazy that conda offers you that version by default... did you ask specifcally for this one?

ADD REPLY
0
Entering edit mode

conda install agat -c bioconda and for mamba, mamba install agat -c bioconda.

For conda:

agat_wsl

Packages libffi, libstdcxxx-ng critical packages for conda. They cannot be upgraded (as they are latest versions) and cannot be removed.

ADD REPLY
1
Entering edit mode

Bioconda recommends having conda-forge as the first priority channel, so conda/mamba install -c conda-forge -c bioconda. I can confirm that agat installs properly on ubuntu when the correct channel priority is set.

ADD REPLY
0
Entering edit mode

Very interesting... I guess in v0.8.0 I will not have this problem anymore. I will update the recipe.

ADD REPLY
0
Entering edit mode

@Juke34 Thank you for sharing the tool. I will give a try using all three approaches suggested by You and @cpad0112. Have a Great day!

ADD REPLY
1
Entering edit mode
3.3 years ago
Divon ▴ 230

You can to this easily using my Genozip tool:

$ genozip myfile.gff3
$ genocat --regions 1:3354-4455 --grep exon --no-header myfile.gff3.genozip 

1       RAP2018-11-26   exon    3354    3616    .       +       .       Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E2;constitutive=1;ensembl_end_phase=0;ensembl_phase=-1;exon_id=Os01t0100100-01-E2;rank=2
1       RAP2018-11-26   exon    4357    4455    .       +       .       Parent=transcript:Os01t0100100-01;Name=Os01t0100100-01-E3;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=Os01t0100100-01-E3;rank=3

See here: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor

ADD COMMENT
0
Entering edit mode

How does it deal with overlaping records? i.e. if a CDS chunck is ouside the span and another inside (from the same record)? And if a feature is partly outside,partly inside, what happen to this feature and what happen to the full record?

P.S: by record I mean a set of related features e.g: gene + mRNA + exon + CDS related to one gene.

ADD REPLY
0
Entering edit mode

If the requirement is that all lines related to a gene (record) are either all included or all excluded, then it becomes a bit more messy. One way to do it is:

genozip myfile.gff3
grep -F "`genocat --regions 1:10700000:16500000 --no-header  myfile.gff3.genozip | awk '$3=="gene"' | cut -f9 | cut -d\; -f1 | cut -c14-`" myfile.gff3
ADD REPLY

Login before adding your answer.

Traffic: 1199 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6