Question

How to write a python code Filtering gff file by gen list as input.

0

Entering edit mode

2.7 years ago

RunCoderRun • 0

I have a list of Genes (Ensemble ids). I want to filter the gff file based on the genes I give as input. My first idea was to use the 9th column "attributes" because the gene_id=ENSG0...... information is there.but I couldn't quite figure out how to handle it according to column 9.

Here is the first few lines of a GTF file.

##gff-version 3
1   havana  pseudogene  11869   14409   .   +   .   ID=gene:ENSG00000223972;Name=DDX11L1;biotype=transcribed_unprocessed_pseudogene;description=DEAD/H-box helicase 11 like 1 (pseudogene) [Source:HGNC Symbol%3BAcc:HGNC:37102];gene_id=ENSG00000223972;logic_name=havana_homo_sapiens;version=5
1   havana  lnc_RNA 11869   14409   .   +   .   ID=transcript:ENST00000456328;Parent=gene:ENSG00000223972;Name=DDX11L1-202;biotype=processed_transcript;tag=basic;transcript_id=ENST00000456328;transcript_support_level=1;version=2
1   havana  exon    11869   12227   .   +   .   Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1
1   havana  exon    12613   12721   .   +   .   Parent=transcript:ENST00000456328;Name=ENSE00003582793;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003582793;rank=2;version=1
1   havana  exon    13221   14409   .   +   .   Parent=transcript:ENST00000456328;Name=ENSE00002312635;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002312635;rank=3;version=1
1   havana  pseudogenic_transcript  12010   13670   .   +   .   ID=transcript:ENST00000450305;Parent=gene:ENSG00000223972;Name=DDX11L1-201;biotype=transcribed_unprocessed_pseudogene;tag=basic;transcript_id=ENST00000450305;transcript_support_level=NA;version=2 
1   ensembl_havana  gene    215567304   215621807   .   +   .   ID=gene:ENSG00000136636;Name=KCTD3;biotype=protein_coding;description=potassium channel tetramerization domain containing 3 [Source:HGNC Symbol%3BAcc:HGNC:21305];gene_id=ENSG00000136636;logic_name=ensembl_havana_gene_homo_sapiens;version=13
1   ensembl_havana  mRNA    215567304   215621807   .   +   .   ID=transcript:ENST00000259154;Parent=gene:ENSG00000136636;Name=KCTD3-201;biotype=protein_coding;ccdsid=CCDS1515.1;tag=basic;transcript_id=ENST00000259154;transcript_support_level=1 (assigned to previous version 8);version=9

I got the GFF file from here Ensembl FTP site (for the homo sapiens)

parser bcbio-gff bioparser GFF3 • 1.2k views

ADD COMMENT • link 2.7 years ago by RunCoderRun • 0

score 0 · Answer 1 · 2022-03-29

0

Entering edit mode

2.7 years ago

Juke34 8.9k

It sounds like an homework. You can use gffutils or BCBio to manipulate GFF ni Python. Otherwise there are tools ready to use like agat_sp_filter_feature_from_keep_list.pl from AGAT (in Perl).

ADD COMMENT • link 2.7 years ago by Juke34 8.9k

0

Entering edit mode

My files are large and my code is doing different operations simultaneously with this operation. The 9th column filtering method was a bit rudimentary or I was wondering how I could make this process more practical. To use ram and processor more efficiently. Thank you for your kind response

ADD REPLY • link 2.7 years ago by RunCoderRun • 0