Entering edit mode
2.7 years ago
RunCoderRun
•
0
I have a list of Genes (Ensemble ids). I want to filter the gff file based on the genes I give as input. My first idea was to use the 9th column "attributes" because the gene_id=ENSG0...... information is there.but I couldn't quite figure out how to handle it according to column 9.
Here is the first few lines of a GTF file.
##gff-version 3
1 havana pseudogene 11869 14409 . + . ID=gene:ENSG00000223972;Name=DDX11L1;biotype=transcribed_unprocessed_pseudogene;description=DEAD/H-box helicase 11 like 1 (pseudogene) [Source:HGNC Symbol%3BAcc:HGNC:37102];gene_id=ENSG00000223972;logic_name=havana_homo_sapiens;version=5
1 havana lnc_RNA 11869 14409 . + . ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-202;biotype=processed_transcript;tag=basic;transcript_id=ENST00000456328;transcript_support_level=1;version=2
1 havana exon 11869 12227 . + . Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
1 havana exon 12613 12721 . + . Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
1 havana exon 13221 14409 . + . Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
1 havana pseudogenic_transcript 12010 13670 . + . ID=transcript:ENST00000450305;Parent=gene:ENSG00000223972;Name=DDX11L1-201;biotype=transcribed_unprocessed_pseudogene;tag=basic;transcript_id=ENST00000450305;transcript_support_level=NA;version=2
1 ensembl_havana gene 215567304 215621807 . + . ID=gene:ENSG00000136636;Name=KCTD3;biotype=protein_coding;description=potassium channel tetramerization domain containing 3 [Source:HGNC Symbol%3BAcc:HGNC:21305];gene_id=ENSG00000136636;logic_name=ensembl_havana_gene_homo_sapiens;version=13
1 ensembl_havana mRNA 215567304 215621807 . + . ID=transcript:ENST00000259154;Parent=gene:ENSG00000136636;Name=KCTD3-201;biotype=protein_coding;ccdsid=CCDS1515.1;tag=basic;transcript_id=ENST00000259154;transcript_support_level=1 (assigned to previous version 8);version=9
I got the GFF file from here Ensembl FTP site (for the homo sapiens)
My files are large and my code is doing different operations simultaneously with this operation. The 9th column filtering method was a bit rudimentary or I was wondering how I could make this process more practical. To use ram and processor more efficiently. Thank you for your kind response