retrieve multiple bacterial genes and flanking regions
1
2
Entering edit mode
5.5 years ago
benys ▴ 20

Hi there

I would like to retrieve the DNA sequences plus ~200 bp flanking regions of hundreds of bacterial genes of a single species . How could this task be automated? Any help would be appreciated!

gene • 1.8k views
ADD COMMENT
1
Entering edit mode
5.5 years ago
thackl ★ 3.0k

Have a look at https://github.com/shenwei356/seqkit and its subseq command: "get subsequences by region/gtf/bed, including flanking sequences". All you need is fasta file of you genome of interest and a gff file (gene annotations). Command could look like this:

seqkit subseq --up-stream 200 --downstream 200 --gtf genome.gff genome.fa
ADD COMMENT
0
Entering edit mode

Thank you thackl. I will definitely give it a try.

ADD REPLY
0
Entering edit mode

Hi again, I've played with seqkit a bit and managed to get the sequences as you suggested. Suppose one would like to get only ORFs that have a TAA stop codon, but still carry the 200 bp flank. Could this be done with seqkit?

ADD REPLY
1
Entering edit mode

Yes, something like the snippet below could work. Needs one little extra piece of command line magic, though. I'm assuming you are working on a bash-like command line:

# get orfs with flanks as before
seqkit subseq --up-stream 200 --downstream 200 --gtf genome.gff genome.fa > all-orfs-with-flanks.ffn
# get orfs w/o flanks
seqkit subseq --gtf genome.gff genome.fa > all-orfs-without-flank.ffn
# grep only sequences that end with TAA and get their ids
seqkit grep -rsp 'TAA$'  all-orfs-without-flank.ffn | perl -ne 's/^>// && print' > orfs-with-TAA-stop-ids.txt
# extract those sequences from the orfs with flanks
seqkit grep -f orfs-with-TAA-stop-ids.txt all-orfs-with-flanks.ffn > orfs-with-TAA-stop-and-flank.ffn
ADD REPLY
0
Entering edit mode

Many thanks! Yes, I use bash. I will play with these commands and let you know.

ADD REPLY
0
Entering edit mode

Hey, thackl All commands except the the last one worked perfectly well. After playing a bit with the sequences it seems to me that there is a subtle matching problem between the sequence ids on the ffn file and the ones in the list orfs-with-TAA-stop-ids.txt I manually copied/pasted a couple of sequences from the all-orfs-with-flank.ffn file into a new file and the corresponding ids into another file and then the 'seqkit grep -f' command worked out. The funny thing is that the ids in the original all-orfs-with-flank.ffn and orfs-with-TAA-stop-ids.txt files look identical. Any idea how to fix this?

ADD REPLY
1
Entering edit mode

Hmm, you're right. seqkit uses coordinates for ids instead of the gene_id from the gff... Hadn't thought about that. I don't have an easy fix for that atm. Need to think about it

ADD REPLY
2
Entering edit mode

OK, this worked for me on a dummy example. Minor fixes to old code & strip the flank annotations from IDs. Only problem, the final file doesn't contain "gene names" but id as chr:start-end...

seqkit subseq --up-stream 200 --down-stream 200 --gtf genome.gff genome.fna > all-orfs-with-flanks.ffn
seqkit subseq --gtf genome.gff genome.fna > all-orfs-without-flank.ffn
seqkit grep -rsp 'TAA$'  all-orfs-without-flank.ffn | perl -ne 's/^>//;s/ $//; print' > orfs-with-TAA-stop-ids.txt
perl -pe 's/>(\S+_\d+-\d+:.).*/>$1/' all-orfs-with-flanks.ffn |
 seqkit grep -f orfs-with-TAA-stop-ids.txt > orfs-with-TAA-stop-and-flank.ffn

ADD REPLY
0
Entering edit mode

It worked! If you would manage to find out how to get the gene names please let me know. Thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 1604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6