position specific sequence retrieval from whole genome sequence
1
0
Entering edit mode
8.0 years ago
psiwach29 ▴ 10

From the complete genome sequence of E. coli, i am isolating the upstream 100 and downstream 50 nucleotide sequences of TSS. Position of TSS on forward strands and reverse strands are available. The sequence from NCBI is of forward strand (5' to 3' from left to right) so the process of retrieving upstream (left from TSS) and downstream (right from TSS) is straightway. Now for the reverse strand I think in the following way- complementary strand of forward is made. It will span from 3' to 5' from left to right as position don't get changed (i.e first nucleotide of forward strand from 5'end will be first nucleotide of complementary strand from 3' end). So for upstream sequences we will take sequence right to TSS and for downstream we will take sequences left to TSS. *I need to know whether i am proceeding in the right way.*

genome sequence gene • 1.6k views
ADD COMMENT
0
Entering edit mode
8.0 years ago

Your strategy is right. The are some tools can help you archive this.

Here's is a solution of command subseq (see usage) of SeqKit, which provides executable binary files for Windows/Linux/Mac, just donwload the .tar.gz file, decompress and run.

Example:

Sequence:

$ cat seq.fa 
>seq
actgnACTGN

GTF file, note that the tss1 is on negative strand.

$ cat f.gtf 
seq     test    CDS     4       6       .       +       .       gene_id "cds1"; transcript_id "cds1"; 
seq     test    TSS     5       7       .       -       .       gene_id "tss1"; transcript_id "tss1";

1) Retriving TSS sequences

$ ./seqkit subseq --gtf f.gtf --feature TSS  seq.fa
>seq_5-7:- tss1
GTn

2) Retriving TSS sequences along with up- and (or) down-stream sequences

$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --down-stream 2 seq.fa 
>seq_5-7:-_us:3_ds:2 tss1
NCAGTnca

~~Here's a bug: the sequences header does not include down-stream information ("ds"). I'll fix this soon.~~Fixed in v0.4.2

3) Retriving up- or down-stream sequence respectively

$ ./seqkit subseq --gtf f.gtf --feature TSS --up-stream 3 --only-flank seq.fa 
>seq_5-7:-_usf:3 tss1
NCA

$ ./seqkit subseq --gtf f.gtf --feature TSS --down-stream 2 --only-flank seq.fa 
>seq_5-7:-_dsf:2 tss1
ca

SeqKit also supports BED file, but only the chromesome, position and strand information are used.

ADD COMMENT
1
Entering edit mode

Thanks a lot. It helped.

ADD REPLY

Login before adding your answer.

Traffic: 1820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6