small sequence ectraction
1
1
Entering edit mode
9.4 years ago

Can anyone tell me how can I do it

I have two files. file1 = 1.fasta asd is like

>123.1
tgcgtgctagctgacctgcgtgcagctgc
>123.2
tcgtcgatcgacgtgcagctgactgcttgct
>123.3
acccggtgcggggggtcgatcgacgtc

file2 = result.ods file in ubuntu and contains data as:

id                seq_start_pos          seq_end_pos
123.1           10                                15
123.2            11                               18
123.3              8                               16

and I want a script which can generate output like:

seq_is         small_seq                   large_seq
123.1             ctgac                        ctagctgacctgc
123.2            cgtgcagc                 tcgacgtgcagctgac
123.3            cggggggt                 ggtgcggggggtcgat

This is the proper format of result which I want. actually small seq is region between 15-10 =5 bp in seq file 1 and large seq is 4-4 bp up/downstream of the region 10-15 .

Basically I want to extract small_seq along with 4-4 bp upstream ad downstream in excel file

If anyone can provide me script for extracting this region, I shall be highly thankful to him/her.

RNA-seq • 1.7k views
ADD COMMENT
1
Entering edit mode

have you tried anything from your end ? You can play around with bedtools getfasta and command paste to achieve what you want or if you know some python, its easy with pyfaidx

ADD REPLY
0
Entering edit mode

I ran this commad biy it is not working

awk '
BEGIN           {print "sequence id\textracted region small\textracted region big upstream and downstream"
                }
NR==FNR &&
FNR>1           {CNT[$1]++
                 S[$1,CNT[$1]]=$2
                 E[$1,CNT[$1]]=$3
                 next
                }
                {split ($1, T, " ")
                }
T[1] in CNT     {i=T[1]
                 $1=x
                 for (j=1; j<=CNT[T[1]]; j++)
                        print RS i "\t" substr ($0,S[i,j],E[i,j]-S[i,j]+1) "\t" substr ($0, S[i,j]-100, E[i,j]-S[i,j]+201)
                }
' file2 RS=\> FS='\n' OFS= file1
ADD REPLY
1
Entering edit mode
9.4 years ago

regions.bed

123    5    10
125    8    13

test.fasta

>123
AGTCGCTCGGCGGCTAGTCGCTCGGCGGCT
CGGCTGCGGCCGGCTGCGGCTGCGGCCGGCTG
>125
AGGATCGGCTCGGCGGAGGATCGGCTCGGCGG
CTCGGCTGCGCTCGGCTCGGCTGCGCTCGG

Using bedtools:

bedtools getfasta -fi test.fasta -bed regions.bed -fo small.fasta
awk '{ print $1"\t"$2-4"\t"$3+4}' regions.bed | bedtools getfasta -fi test.fasta -bed - -fo long.fasta
paste small.fasta long.fasta | paste - - | awk '{ gsub(">.*:",":",$2); print $1$2"\t"$3"\t"$4}'

Using pyfaidx:

from pyfaidx import Fasta
sq = Fasta("test.fasta")
flank=4
with open("regions.txt") as regions:
    for line in regions:
        cord=line.split()
        print sq[cord[0]].name,sq[cord[0]][int(cord[1]):int(cord[2])],sq[cord[0]][int(cord[1])-flank:int(cord[2])+flank]

Note: When adding the flanking regions, make sure the coordinates don't go out of length(sequence). Otherwise the programs scold you with error messages.

ADD COMMENT

Login before adding your answer.

Traffic: 1924 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6