Question

small sequence ectraction

1

Entering edit mode

9.4 years ago

harpreetmanku04 ▴ 10

Can anyone tell me how can I do it

I have two files. file1 = 1.fasta asd is like

>123.1
tgcgtgctagctgacctgcgtgcagctgc
>123.2
tcgtcgatcgacgtgcagctgactgcttgct
>123.3
acccggtgcggggggtcgatcgacgtc

file2 = result.ods file in ubuntu and contains data as:

id                seq_start_pos          seq_end_pos
123.1           10                                15
123.2            11                               18
123.3              8                               16

and I want a script which can generate output like:

seq_is         small_seq                   large_seq
123.1             ctgac                        ctagctgacctgc
123.2            cgtgcagc                 tcgacgtgcagctgac
123.3            cggggggt                 ggtgcggggggtcgat

This is the proper format of result which I want. actually small seq is region between 15-10 =5 bp in seq file 1 and large seq is 4-4 bp up/downstream of the region 10-15 .

Basically I want to extract small_seq along with 4-4 bp upstream ad downstream in excel file

If anyone can provide me script for extracting this region, I shall be highly thankful to him/her.

RNA-seq • 1.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.4 years ago by harpreetmanku04 ▴ 10

1

Entering edit mode

have you tried anything from your end ? You can play around with bedtools getfasta and command paste to achieve what you want or if you know some python, its easy with pyfaidx

ADD REPLY • link updated 9.4 years ago by Matt Shirley 10k • written 9.4 years ago by GouthamAtla 12k

0

Entering edit mode

I ran this commad biy it is not working

awk '
BEGIN           {print "sequence id\textracted region small\textracted region big upstream and downstream"
                }
NR==FNR &&
FNR>1           {CNT[$1]++
                 S[$1,CNT[$1]]=$2
                 E[$1,CNT[$1]]=$3
                 next
                }
                {split ($1, T, " ")
                }
T[1] in CNT     {i=T[1]
                 $1=x
                 for (j=1; j<=CNT[T[1]]; j++)
                        print RS i "\t" substr ($0,S[i,j],E[i,j]-S[i,j]+1) "\t" substr ($0, S[i,j]-100, E[i,j]-S[i,j]+201)
                }
' file2 RS=\> FS='\n' OFS= file1

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

see my answer here: A: How to extract a region of protein or small domain '150-300' from a long sequenc

ADD REPLY • link 9.4 years ago by Pierre Lindenbaum 164k

Ram · Answer 1 · 2015-08-07

regions.bed

123    5    10
125    8    13

test.fasta

>123
AGTCGCTCGGCGGCTAGTCGCTCGGCGGCT
CGGCTGCGGCCGGCTGCGGCTGCGGCCGGCTG
>125
AGGATCGGCTCGGCGGAGGATCGGCTCGGCGG
CTCGGCTGCGCTCGGCTCGGCTGCGCTCGG

Using bedtools:

bedtools getfasta -fi test.fasta -bed regions.bed -fo small.fasta
awk '{ print $1"\t"$2-4"\t"$3+4}' regions.bed | bedtools getfasta -fi test.fasta -bed - -fo long.fasta
paste small.fasta long.fasta | paste - - | awk '{ gsub(">.*:",":",$2); print $1$2"\t"$3"\t"$4}'

Using pyfaidx:

from pyfaidx import Fasta
sq = Fasta("test.fasta")
flank=4
with open("regions.txt") as regions:
    for line in regions:
        cord=line.split()
        print sq[cord[0]].name,sq[cord[0]][int(cord[1]):int(cord[2])],sq[cord[0]][int(cord[1])-flank:int(cord[2])+flank]

Note: When adding the flanking regions, make sure the coordinates don't go out of length(sequence). Otherwise the programs scold you with error messages.