Hi,
I have a batch of GFF files for bacteria genomes, which have been edited from the original GFF file from NCBI ftp site. In the edited GFF, I have inserted sRNA features. This is an example (dummy example) to illustrate the edited GFF:
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_016077.1 1 2487765
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=568816
NC_016077.1 RefSeq region 1 2487765 . + . ID=id0;Dbxref=taxon:568816;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;old-name=Acidaminococcus intestinalis RYC-MR95;strain=RyC-MR95
NC_016077.1 RefSeq CDS 175 1551 . + 0 ID=cds0;Name=YP_004895395.1;Parent=gene0;Note=COG0593;Dbxref=Genbank:YP_004895395.1,GeneID:11266123;gbkey=CDS;product=chromosomal replication initiator protein;protein_id=YP_004895395.1;transl_table=11
NC_016077.1 RefSeq CDS 1626 1778 . + 0 ID=cds1;Name=YP_004895396.1;Parent=gene1;Dbxref=Genbank:YP_004895396.1,GeneID:11264393;gbkey=CDS;product=hypothetical protein;protein_id=YP_004895396.1;transl_table=11
NC_016077.1 RefSeq sRNA 1799 1825 . + 0 ID=sRNA1;Name=sRNA1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=sRNA;product=sRNA 1;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq CDS 1829 2947 . + 0 ID=cds2;Name=YP_004895397.1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=CDS;product=DNA polymerase III;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq CDS 2953 4101 . + 0 ID=cds3;Name=YP_004895398.1;Parent=gene3;Dbxref=Genbank:YP_004895398.1,GeneID:11264395;gbkey=CDS;product=recombination protein F;protein_id=YP_004895398.1;transl_table=11
So the bold one is sRNA. I want to extract a gene before and a gene after the sRNA sequence, in this case they are cds1 and cds2. The output should be in tab delimited format so that I could extract the fasta sequence then. In a GFF, there are multiple sRNAs to extract and I have many GFF files to perform this task.
Anyone mind to to help?
Regards
Thanks for the help. I will try this one, looks like it might work well.Having said that I have a lot of GFF to be done, I believe it's just to run the script in bash file, am I right?
Just curious here, if I am not mistaken, it will take the closest feature to the sRNA that we want right? In my case, I want the one upstream and one downstream of the sRNA, meaning that, for an sRNA, it should have 2 flanking genes.
Another thing is, what if I want to restrict the extraction to certain distance from my sRNA. Let say, 'extract the gene sits next & after the sRNA, provided their distance from sRNA (gap between gene & sRNA) is less than 200bp'? Any idea?
Add
--dist
to report the distance in bases for each queried CDS from its sRNA. Then useawk
to filter the CDS results by the distance threshold of your choice. Edit: See edited answer for code.