Extract flanking genes around sRNA from gff files
1
0
Entering edit mode
9.5 years ago
CikLa ▴ 90

Hi,

I have a batch of GFF files for bacteria genomes, which have been edited from the original GFF file from NCBI ftp site. In the edited GFF, I have inserted sRNA features. This is an example (dummy example) to illustrate the edited GFF:

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_016077.1 1 2487765
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=568816
NC_016077.1 RefSeq  region  1   2487765 .   +   .   ID=id0;Dbxref=taxon:568816;Is_circular=true;gbkey=Src;genome=chromosome;mol_type=genomic DNA;old-name=Acidaminococcus intestinalis RYC-MR95;strain=RyC-MR95
NC_016077.1 RefSeq  CDS 175 1551    .   +   0   ID=cds0;Name=YP_004895395.1;Parent=gene0;Note=COG0593;Dbxref=Genbank:YP_004895395.1,GeneID:11266123;gbkey=CDS;product=chromosomal replication initiator protein;protein_id=YP_004895395.1;transl_table=11
NC_016077.1 RefSeq  CDS 1626    1778    .   +   0   ID=cds1;Name=YP_004895396.1;Parent=gene1;Dbxref=Genbank:YP_004895396.1,GeneID:11264393;gbkey=CDS;product=hypothetical protein;protein_id=YP_004895396.1;transl_table=11
NC_016077.1 RefSeq  sRNA    1799    1825    .   +   0   ID=sRNA1;Name=sRNA1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=sRNA;product=sRNA 1;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq  CDS 1829    2947    .   +   0   ID=cds2;Name=YP_004895397.1;Parent=gene2;Dbxref=Genbank:YP_004895397.1,GeneID:11264394;gbkey=CDS;product=DNA polymerase III;protein_id=YP_004895397.1;transl_table=11
NC_016077.1 RefSeq  CDS 2953    4101    .   +   0   ID=cds3;Name=YP_004895398.1;Parent=gene3;Dbxref=Genbank:YP_004895398.1,GeneID:11264395;gbkey=CDS;product=recombination protein F;protein_id=YP_004895398.1;transl_table=11

So the bold one is sRNA. I want to extract a gene before and a gene after the sRNA sequence, in this case they are cds1 and cds2. The output should be in tab delimited format so that I could extract the fasta sequence then. In a GFF, there are multiple sRNAs to extract and I have many GFF files to perform this task.

Anyone mind to to help?

Regards

sRNA • 3.1k views
ADD COMMENT
3
Entering edit mode
9.5 years ago

Use BEDOPS gff2bed to convert GFF to BED, awk to create CDS and sRNA subsets, and BEDOPS closest-features to extract the nearest CDS features to sRNA elements:

$ gff2bed < features.gff > features.bed
$ awk '$8=="CDS"' features.bed > CDS.bed
$ awk '$8=="sRNA"' features.bed > sRNA.bed
$ closest-features --no-ref --delim '\n' sRNA.bed CDS.bed | sort-bed - > nearest_CDS_per_sRNA.bed

Then you can take the BED file nearest_CDS_per_sRNA.bed and convert it to FASTA with bed2fasta or similar approaches.

If you need to filter by distance, add the --dist option, remove the --delim option, and use awk to place a distance threshold on results:

$ closest-features --no-ref --dist sRNA.bed CDS.bed |
      awk -v threshold=200 '{
          n = split($0, a, "|");
          for (i = 0; I < n; i++) {
              if ((i % 2) && (a[i+1] <= threshold)) {
                  print a[i]"\t"a[i+1];
              }
          }
      }' - |
      sort-bed - > thresholded_nearest_CDS_per_sRNA.bed

This can be run directly on the command line (e.g., within a bash shell) or if you plan on doing this frequently, you could write this into a shell script or makefile.

ADD COMMENT
0
Entering edit mode

Thanks for the help. I will try this one, looks like it might work well.Having said that I have a lot of GFF to be done, I believe it's just to run the script in bash file, am I right?

ADD REPLY
0
Entering edit mode

Just curious here, if I am not mistaken, it will take the closest feature to the sRNA that we want right? In my case, I want the one upstream and one downstream of the sRNA, meaning that, for an sRNA, it should have 2 flanking genes.

Another thing is, what if I want to restrict the extraction to certain distance from my sRNA. Let say, 'extract the gene sits next & after the sRNA, provided their distance from sRNA (gap between gene & sRNA) is less than 200bp'? Any idea?

ADD REPLY
0
Entering edit mode

Add --dist to report the distance in bases for each queried CDS from its sRNA. Then use awk to filter the CDS results by the distance threshold of your choice. Edit: See edited answer for code.

ADD REPLY

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6