I have a batch of GFF files for bacteria genomes, which have been edited from the original GFF file from NCBI ftp site. In the edited GFF, I have inserted sRNA features. This is an example (dummy example) to illustrate the edited GFF:
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_016077.1 1 2487765
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=568816
NC_016077.1 RefSeq region 1 2487765 . + . ID=id0;Dbxref=taxon:568816;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;old-name=Acidaminococcus intestinalis RYC-MR95;strain=RyC-MR95
NC_016077.1 RefSeq CDS 175 1551 . + 0 ID=cds0;Name=YP_004895395.1;Parent=gene0;Note=COG0593;Dbxref=Genbank:YP_004895395.1,GeneID:11266123;gbkey=CDS;product=chromosomal replication initiator protein;protein_id=YP_004895395.1;transl_table=11
NC_016077.1 RefSeq CDS 1626 1778 . + 0 ID=cds1;Name=YP_004895396.1;Parent=gene1;Dbxref=Genbank:YP_004895396.1,GeneID:11264393;gbkey=CDS;product=hypothetical protein;protein_id=YP_004895396.1;transl_table=11
NC_016077.1 RefSeq sRNA 1799 1825 . + 0 ID=sRNA1;Name=sRNA1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=sRNA;product=sRNA 1;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq CDS 1829 2947 . + 0 ID=cds2;Name=YP_004895397.1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=CDS;product=DNA polymerase III;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq CDS 2953 4101 . + 0 ID=cds3;Name=YP_004895398.1;Parent=gene3;Dbxref=Genbank:YP_004895398.1,GeneID:11264395;gbkey=CDS;product=recombination protein F;protein_id=YP_004895398.1;transl_table=11
So the bold one is sRNA. I want to extract a gene before and a gene after the sRNA sequence, in this case they are cds1 and cds2. The output should be in tab delimited format so that I could extract the fasta sequence then. In a GFF, there are multiple sRNAs to extract and I have many GFF files to perform this task.
Anyone mind to to help?
Thanks for the help. I will try this one, looks like it might work well.Having said that I have a lot of GFF to be done, I believe it's just to run the script in bash file, am I right?
Just curious here, if I am not mistaken, it will take the closest feature to the sRNA that we want right? In my case, I want the one upstream and one downstream of the sRNA, meaning that, for an sRNA, it should have 2 flanking genes.
Another thing is, what if I want to restrict the extraction to certain distance from my sRNA. Let say, 'extract the gene sits next & after the sRNA, provided their distance from sRNA (gap between gene & sRNA) is less than 200bp'? Any idea?
to report the distance in bases for each queried CDS from its sRNA. Then useawk
to filter the CDS results by the distance threshold of your choice. Edit: See edited answer for code.