Question

retrieve multiple bacterial genes and flanking regions

2

Entering edit mode

5.5 years ago

benys ▴ 20

Hi there

I would like to retrieve the DNA sequences plus ~200 bp flanking regions of hundreds of bacterial genes of a single species . How could this task be automated? Any help would be appreciated!

gene • 1.8k views

ADD COMMENT • link updated 5.5 years ago by thackl ★ 3.0k • written 5.5 years ago by benys ▴ 20

score 1 · Answer 1 · 2019-06-11

1

Entering edit mode

5.5 years ago

thackl ★ 3.0k

Have a look at https://github.com/shenwei356/seqkit and its subseq command: "get subsequences by region/gtf/bed, including flanking sequences". All you need is fasta file of you genome of interest and a gff file (gene annotations). Command could look like this:

seqkit subseq --up-stream 200 --downstream 200 --gtf genome.gff genome.fa

ADD COMMENT • link 5.5 years ago by thackl ★ 3.0k

0

Entering edit mode

Thank you thackl. I will definitely give it a try.

ADD REPLY • link 5.5 years ago by benys ▴ 20

0

Entering edit mode

Hi again, I've played with seqkit a bit and managed to get the sequences as you suggested. Suppose one would like to get only ORFs that have a TAA stop codon, but still carry the 200 bp flank. Could this be done with seqkit?

ADD REPLY • link 5.5 years ago by benys ▴ 20

1

Entering edit mode

Yes, something like the snippet below could work. Needs one little extra piece of command line magic, though. I'm assuming you are working on a bash-like command line:

# get orfs with flanks as before
seqkit subseq --up-stream 200 --downstream 200 --gtf genome.gff genome.fa > all-orfs-with-flanks.ffn
# get orfs w/o flanks
seqkit subseq --gtf genome.gff genome.fa > all-orfs-without-flank.ffn
# grep only sequences that end with TAA and get their ids
seqkit grep -rsp 'TAA$'  all-orfs-without-flank.ffn | perl -ne 's/^>// && print' > orfs-with-TAA-stop-ids.txt
# extract those sequences from the orfs with flanks
seqkit grep -f orfs-with-TAA-stop-ids.txt all-orfs-with-flanks.ffn > orfs-with-TAA-stop-and-flank.ffn

ADD REPLY • link 5.5 years ago by thackl ★ 3.0k

0

Entering edit mode

Many thanks! Yes, I use bash. I will play with these commands and let you know.

ADD REPLY • link 5.5 years ago by benys ▴ 20

0

Entering edit mode

Hey, thackl All commands except the the last one worked perfectly well. After playing a bit with the sequences it seems to me that there is a subtle matching problem between the sequence ids on the ffn file and the ones in the list orfs-with-TAA-stop-ids.txt I manually copied/pasted a couple of sequences from the all-orfs-with-flank.ffn file into a new file and the corresponding ids into another file and then the 'seqkit grep -f' command worked out. The funny thing is that the ids in the original all-orfs-with-flank.ffn and orfs-with-TAA-stop-ids.txt files look identical. Any idea how to fix this?

ADD REPLY • link 5.5 years ago by benys ▴ 20

1

Entering edit mode

Hmm, you're right. seqkit uses coordinates for ids instead of the gene_id from the gff... Hadn't thought about that. I don't have an easy fix for that atm. Need to think about it

ADD REPLY • link 5.4 years ago by thackl ★ 3.0k

2

Entering edit mode

OK, this worked for me on a dummy example. Minor fixes to old code & strip the flank annotations from IDs. Only problem, the final file doesn't contain "gene names" but id as chr:start-end...

seqkit subseq --up-stream 200 --down-stream 200 --gtf genome.gff genome.fna > all-orfs-with-flanks.ffn
seqkit subseq --gtf genome.gff genome.fna > all-orfs-without-flank.ffn
seqkit grep -rsp 'TAA$'  all-orfs-without-flank.ffn | perl -ne 's/^>//;s/ $//; print' > orfs-with-TAA-stop-ids.txt
perl -pe 's/>(\S+_\d+-\d+:.).*/>$1/' all-orfs-with-flanks.ffn |
 seqkit grep -f orfs-with-TAA-stop-ids.txt > orfs-with-TAA-stop-and-flank.ffn

ADD REPLY • link 5.4 years ago by thackl ★ 3.0k

0

Entering edit mode

It worked! If you would manage to find out how to get the gene names please let me know. Thanks a lot!

ADD REPLY • link 5.4 years ago by benys ▴ 20