Hi everyone,
Can anyone point me in the direction of how I can extract all features within a given genomic range from a .gff3 file? For example, if I have a list of genomic regions I am interested in, with scaffold ID and start to end position - how can I extract the gene IDs from the gff?
I am extremely new to bioinformatics so please excuse me if this is super straightforward or has been answered elsewhere...
I have found posts where you can use Bioconductor packages to extract features based on ID (i.e. searching for a specific gene) but can't see how to extract all genes in a given range.
The number of tools out there are overwhelming and I am pretty sure this is the kind of straightforward thing someone could easily answer!
Thanks in advance.
You can filter your gff to keep only the region of interest as indicated here.
Then 2 possibilities using AGAT:
You could use something like
agat_convert_sp_gff2tsv.pl
And then extract what you need from the tsv:
where
$10
is the 10th column (considering the ID is in the 10th column, check the first line to see in which column the |D is)Use
agat_sp_extract_attributes.pl
:Thank you both for your replies - great to be made aware of bedtools and AGAT which both look very helpful!
You should be able to use bedtools - it works on GFF files. The "list of genomic regions" should be a BED file (0-based).