Entering edit mode
2.4 years ago
elcortegano
▴
200
Hi all, I'm trying to filter a large GFF file with many gene/exon entries to make it retain only those gene/exons entries that contain CDS in their hierarchy. eg. excluding genes or exons related to ncRNA that do not have CDS.
I am not aware of any tool that allows to do this filtering, but I assume there must be one. Does anyone know how to do this? Thanks!
EDIT
To provide an example, I'd like to keep sequences like the ones in contig ptg000013l below (where an exon has a CDS annotation contained within in), and exclude other exon annotations without CDS sequences within.
ptg000013l ensembl exon 49126502 49128513 . - . Parent=transcript:ENSMUST00000238969;Name=ENSMUSE00000644098;constitutive=0;ensembl_end_phase=-1;ensembl_phase=0;exon_id=ENSMUSE00000644098;rank=8;v
ptg000013l havana CDS 49127986 49128513 . - 0 ID=CDS:ENSMUSP00000158947;Parent=transcript:ENSMUST00000238953;protein_id=ENSMUSP00000158947
ptg000048l havana exon 8219576 8219759 . - . Parent=transcript:ENSMUST00000211519;Name=ENSMUSE00001383012;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSMUSE00001383012;rank=1;v
it would help if you could post few lines. GFF file is a text file and you can filter the text file based on pattern. Based on three lines and assuming that gtf fields are tab separated:
This will only return the CDS entry. The trick is to get as well exons (and gene entries if present), but only those with CDS annotation within them. In the example case, I'd expect to get the CDS annotation as well as exon in ptg000013l, since it does contain a CDS, but not the other exon in ptg000048l.