Extract a specific subsequence from a fasta file
2
0
Entering edit mode
7.9 years ago
PAn ▴ 20

Hi,

I have a 14 bp long sequence and I need to extract 10 bp +/- of this sequence from a list of fatsa sequences. Is there a tool that I can use to get the sequence string specific trimmed reads?

The substring is GCACGAAGTTTTGC a sample read from the fasta file looks like -

>58_1_uni0133|DBLa
TGTAAGCTTAGTCACAAATTCCATACTAATATAACATATGAATACGAGAGAGATCCTTGT
CATGGAAGAAAAGAAAATCGTTTTGATGAAAATGAGGAATTTGAATGTGGAACTAAAATA
CGTGATTATAATAAAAAAGATTCTGGTACAGCATGTGCACCATTCAGAAGACAAAATATG
TGTGATAAAAATTTAGAATATTTGATCAATAAAAACACAGAAAATACTGATGATTTGTTA
GGAAATGTATTGGTTACAGCAAAATATGAAGGTGAATCTATTGTTGCGAAGCATCCACAT
AAAGACAATTCACAAGTATGTACTGCACTTGCACGAAGTTTTGCAGATATAGGAGATATT
GTAAGAGGAAGAGATATGTTTTTACCTAATAAGGATGATAAAGTACAAAAAGGACTACAA
GTAGTTTTCGAGAAAATAAATAATGGATTGAAGAAAATAGGAATTAATGCTTATAATGAT
GGATCTGGAAATTATTCTAAATTAAGAGAAGTTTGGTGGAATGTGAATAGAGACCAGGTA
TGGAGAGCTATAACATGTTCAGCACCAGGTGATGTTAATTATTTTAGAAAAATTTCAGGA
GACACTAGGACCTTTGAAAA

and the fasta file has around 170 reads with variable length of 600 to 800 bp. I tried to find tools but has no success. Is there a better alternative than writing a code?

Thanks! Ankita

fatsa sequence extract • 3.4k views
ADD COMMENT
5
Entering edit mode
7.9 years ago

using awk:

awk  '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta | awk -F '\t' '{x=index($2,"GCACGAAGTTTTGC");if(!x) next; B=(x<10?1:x-10);E=x+10+14;printf("%s\n%s\n",$1,substr($2,B,E-B));}' 

>58_1_uni0133|DBLa
TACTGCACTTGCACGAAGTTTTGCAGATATAGGA
ADD COMMENT
0
Entering edit mode

This works perfectly!

ADD REPLY
1
Entering edit mode
7.9 years ago
IP ▴ 770

Even though you would need to code a little, this will help you, this how I would proceed:

First of all, I will identify the location of your sequences in the fasta file using a script posted in an old post: Finding a sequence in a fasta file that can be accesed here https://github.com/dariober/bioinformatics-cafe/tree/master/fastaRegexFinder

This will output a bed file of the locations of the sequence in your fasta. Then, programming just a little, you could add and subtract 10 to the locations in your bed file, and finally you can use getfasta from bedtools to get the sequences you want

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD COMMENT
1
Entering edit mode

Okay @Pierre Lindenbaum answer is going to work much faster than mine

ADD REPLY

Login before adding your answer.

Traffic: 2046 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6