Question

Extract a specific subsequence from a fasta file

0

Entering edit mode

7.9 years ago

PAn ▴ 20

Hi,

I have a 14 bp long sequence and I need to extract 10 bp +/- of this sequence from a list of fatsa sequences. Is there a tool that I can use to get the sequence string specific trimmed reads?

The substring is GCACGAAGTTTTGC a sample read from the fasta file looks like -

>58_1_uni0133|DBLa
TGTAAGCTTAGTCACAAATTCCATACTAATATAACATATGAATACGAGAGAGATCCTTGT
CATGGAAGAAAAGAAAATCGTTTTGATGAAAATGAGGAATTTGAATGTGGAACTAAAATA
CGTGATTATAATAAAAAAGATTCTGGTACAGCATGTGCACCATTCAGAAGACAAAATATG
TGTGATAAAAATTTAGAATATTTGATCAATAAAAACACAGAAAATACTGATGATTTGTTA
GGAAATGTATTGGTTACAGCAAAATATGAAGGTGAATCTATTGTTGCGAAGCATCCACAT
AAAGACAATTCACAAGTATGTACTGCACTTGCACGAAGTTTTGCAGATATAGGAGATATT
GTAAGAGGAAGAGATATGTTTTTACCTAATAAGGATGATAAAGTACAAAAAGGACTACAA
GTAGTTTTCGAGAAAATAAATAATGGATTGAAGAAAATAGGAATTAATGCTTATAATGAT
GGATCTGGAAATTATTCTAAATTAAGAGAAGTTTGGTGGAATGTGAATAGAGACCAGGTA
TGGAGAGCTATAACATGTTCAGCACCAGGTGATGTTAATTATTTTAGAAAAATTTCAGGA
GACACTAGGACCTTTGAAAA

and the fasta file has around 170 reads with variable length of 600 to 800 bp. I tried to find tools but has no success. Is there a better alternative than writing a code?

Thanks! Ankita

fatsa sequence extract • 3.4k views

ADD COMMENT • link updated 7.9 years ago by IP ▴ 770 • written 7.9 years ago by PAn ▴ 20

1

Entering edit mode

7.9 years ago

IP ▴ 770

Even though you would need to code a little, this will help you, this how I would proceed:

First of all, I will identify the location of your sequences in the fasta file using a script posted in an old post: Finding a sequence in a fasta file that can be accesed here https://github.com/dariober/bioinformatics-cafe/tree/master/fastaRegexFinder

This will output a bed file of the locations of the sequence in your fasta. Then, programming just a little, you could add and subtract 10 to the locations in your bed file, and finally you can use getfasta from bedtools to get the sequences you want

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD COMMENT • link 7.9 years ago by IP ▴ 770

1

Entering edit mode

Okay @Pierre Lindenbaum answer is going to work much faster than mine

ADD REPLY • link 7.9 years ago by IP ▴ 770

score 5 · Accepted Answer · 2017-02-01

5

Entering edit mode

7.9 years ago

Pierre Lindenbaum 164k

using awk:

awk  '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta | awk -F '\t' '{x=index($2,"GCACGAAGTTTTGC");if(!x) next; B=(x<10?1:x-10);E=x+10+14;printf("%s\n%s\n",$1,substr($2,B,E-B));}' 

>58_1_uni0133|DBLa
TACTGCACTTGCACGAAGTTTTGCAGATATAGGA