Dear Biostars community,
I am working on a codon usage bias analysis on bacterial genomes, and I would like to create a reference set of genes that are predicted to be highly expressed. I am therefore wondering if there is a way to programmatically filter annotated genes retrieved from GenBank? Specifically, I have several large txt files for each of my bacteria with annotated genes in the following format:
lcl|NC_002947.4_cds_WP_003250662.1_2506 [gene=infC] [locus_tag=PP_RS12860] [protein=translation initiation factor IF-3] [protein_id=WP_003250662.1] [location=2811534..2812067] [gbkey=CDS] ATGAGAAACGATAAGCGTGCTGTACCGAAGGCCCCGATCAACGAGAATATCTCGGCCCGCGAGGTTCGGT TAATTGGCGCAGACGGCGAGCAGGTTGGCATCGTCTCGATTGATGAAGCGCTGCGTATCGCTGATGAAGC GAAGCTGGACCTGGTAGAAATCTCTGCAGACGCGGTACCACCCGTCTGCAAGGTCATGGACTACGGTAAG CACCTCTTCGAGAAGAAGAAGCAGGCTAACGAAGCCAAGAAAAACCAGAAACAGATCCAGATCAAAGAAA TCAAGTTTCGTCCAGGGACGGAGGATGGGGATTACCAGGTAAAACTACGCAACCTGGTACGTTTCCTTAC CGATGGGGACAAGGCCAAGATCTCTCTGAGATTCCGTGGTCGTGAGATGGCCCACCAGGAGCTGGGCATG GAGCTGTTGAAGCGGGTCGAAGCCGACCTCGCCGAATACGGCACCGTTGAGCAGCATCCGAAGATGGAAG GACGCCAGCTTATGATGGTCATCGCCCCCAAGAAAAAGAAGTAA lcl|NC_002947.4_cds_WP_003250667.1_2507 [gene=rpmI] [locus_tag=PP_RS12865] [protein=50S ribosomal protein L35] [protein_id=WP_003250667.1] [location=2812128..2812322] [gbkey=CDS] ATGCCAAAAATGAAAACCAAGAGCGGTGCTGCGAAGCGCTTCCTGAAGACAGCTTCGGGCTTCAAGCACA AGCACGCTTTCAAGAGCCACATCCTGACCAAAATGTCGACCAAGCGTAAGCGTCAACTGCGCGGTGCCAG CTTGCTGCACCCGTCCGACGTCGCAAAAGTCGAGCGCATGCTGCGCGTTCGTTAA lcl|NC_002947.4_cds_WP_003250671.1_2508 [gene=rplT] [locus_tag=PP_RS12870] [protein=50S ribosomal protein L20] [protein_id=WP_003250671.1] [location=2812350..2812706] [gbkey=CDS] ATGGCTCGTGTTAAGCGCGGCGTTATCGCTCGTAAGCGTCACAAAAAAATTCTGAAACTGGCTAAAGGTT ACTACGGTGCACGTTCGCGCGTATTCCGTGTAGCCAAGCAAGCGGTCATCAAGGCAGGCCAATACGCCTA CCGCGACCGTCGCCAGAAGAAGCGTCAGTTCCGCGCACTGTGGATCGCTCGTATCAACGCCGGTGCCCGC ACCAACGGTCTGTCCTACAGCCGTCTGATTGCTGGCCTGAAAAAGGCTTCGATCGAAATCGACCGTAAGG TTCTGGCTGATCTGGCAGTGAACGAAAAAGCGGCGTTTGCTGCGATTGTCGAGAAAGCTAAAGCCGTTCT GGCTTAA
Is there any way I can filter/select/pull only the genes that are annotated as "ribosomal"/"chaperone"/etc.? I believe KEGG database has this "filtering by annotation" feature, but not all the genomes of my bacteria can be found in KEGG. The file format of my genes can also be of any other kind of format downloadable from GenBank (the format of the example above is "cds_from_genome.fa", I am using it for now because it is readable using coRdon package in R).
Also, after I have my reference sets of genes from each genome of my bacteria, I would like to compare them and find genes common across all my genomes, any tips for that as well?
Much thanks!
-D
If you're trying to extract these keywords from "cds_from_genome.fa", you can try:
You only need to search the headers, so if you have other files that do not have the sequences, or have ">" before the headers so that you can only get the headers (using grep ">" or awk ), and then search for the keywords, that might be more efficient.