Entering edit mode
5.5 years ago
benys
▴
20
Hi there
I would like to retrieve the DNA sequences plus ~200 bp flanking regions of hundreds of bacterial genes of a single species . How could this task be automated? Any help would be appreciated!
Thank you thackl. I will definitely give it a try.
Hi again, I've played with seqkit a bit and managed to get the sequences as you suggested. Suppose one would like to get only ORFs that have a TAA stop codon, but still carry the 200 bp flank. Could this be done with seqkit?
Yes, something like the snippet below could work. Needs one little extra piece of command line magic, though. I'm assuming you are working on a bash-like command line:
Many thanks! Yes, I use bash. I will play with these commands and let you know.
Hey, thackl All commands except the the last one worked perfectly well. After playing a bit with the sequences it seems to me that there is a subtle matching problem between the sequence ids on the ffn file and the ones in the list orfs-with-TAA-stop-ids.txt I manually copied/pasted a couple of sequences from the all-orfs-with-flank.ffn file into a new file and the corresponding ids into another file and then the 'seqkit grep -f' command worked out. The funny thing is that the ids in the original all-orfs-with-flank.ffn and orfs-with-TAA-stop-ids.txt files look identical. Any idea how to fix this?
Hmm, you're right. seqkit uses coordinates for ids instead of the gene_id from the gff... Hadn't thought about that. I don't have an easy fix for that atm. Need to think about it
OK, this worked for me on a dummy example. Minor fixes to old code & strip the flank annotations from IDs. Only problem, the final file doesn't contain "gene names" but id as chr:start-end...
It worked! If you would manage to find out how to get the gene names please let me know. Thanks a lot!