Question

Extract User Defined Region From An Fasta File

11

Entering edit mode

14.1 years ago

Palu ▴ 290

I have a file with multi fasta format sequence

>lyrata FEN1
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVVSEVIPQYTGSQGTKCLLRQYKPAISNKVSSLHAFGWNSSKNIVSIRLFSLSWG
FNLSEIPESTTKGAANKKTKGAGGRKKK

>Arabidopsis thaliana FEN1
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVASEVIPNILDHRVPNACSGSINQPSAIKSPLCMLLDGTPPITLLALGCSLYPGL
PTFRFTWLEEIPESTTKGAANKKTKGAGGRKKK

>Brachypodium distachyon - II FEN1
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT
KKHNDDCKKLLRLMGVPVVEAPGEAEAQCASLCKNHKAYAVASEDMDSLTFGSLRFLRHI
TDLSFKRSPVTEFEVPKVLEELGLTMDQFIDLCILSGCDYCENIKGIGGQRALKLIRQHG
CIEEVVQNLNNRFTVPEDWPYQEVRTLFKEPNVCTEIPDFQWTSVDKEGIVNFLAIENSF
SSDRVEKAVQKIKAARDRYSPGRVKLLTPVANLSGSITKKEPECVLGSSGQGLKSWSALQ
VCRSSSSDFRYRSYYSSKPLVLGRQSGFHGIPHAFSLI

and i want to extract sequence of defined region. these parameters are present in other file

lyrata            1      108 
lyrata             147    234 
Arabidopsis          1    108 
Arabidopsis        147    234 
Brachypodium         1     93 
Brachypodium       132    219

I have asked similar question before and got an awk solution but it gives error. Please help me. i am a window user.

sequence fasta • 80k views

ADD COMMENT • link updated 8.0 years ago by tuomastik ▴ 30 • written 14.1 years ago by Palu ▴ 290

1

Entering edit mode

HI, thanks for answering my previous and this question.. I found a very good solution of my problem ans love to share with ther have same problem. visit this page. it's working good

ADD REPLY • link 14.1 years ago by Palu ▴ 290

0

Entering edit mode

This is not a discussion forum, It's Q&A. Youve posted this as an 'answer' which doesn't make sense. To reply to another answer use the comment button underneath it, like I am here.

ADD REPLY • link 14.1 years ago by Karl ▴ 350

6

Entering edit mode

14.1 years ago

Christian ★ 3.1k

BioPerl is your friend:

use Bio::DB::Fasta;
my $db = Bio::DB::Fasta->new('myfile.fasta');
my $seq = $db->seq('lyrata', 1 => 108);

See also the documentation for Bio::DB::Fasta

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 14.1 years ago by Christian ★ 3.1k

3

Entering edit mode

8.0 years ago

tuomastik ▴ 30

It's also possible to use bedtools getfasta, which:

extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.

ADD COMMENT • link 8.0 years ago by tuomastik ▴ 30

1

Entering edit mode

14.1 years ago

Vitis ★ 2.6k

the command-line blast package will do. first use formatdb with -o option to build a database of your fasta, then use fastacmd with -s -L options to get the regions.

ADD COMMENT • link 14.1 years ago by Vitis ★ 2.6k

0

Entering edit mode

could u give full command example for that...

thanks

ADD REPLY • link 10.0 years ago by Kumar ▴ 170

0

Entering edit mode

-o option showing unknown argument error

ADD REPLY • link 10.0 years ago by Kumar ▴ 170

1

Entering edit mode

10.0 years ago

daniel ▴ 30

Just thought I'd give my C++ library, Bioio, a mention here. Although primarily a C++ API for FASTA/Q I/O, I also include a small command line application demo fasta.cpp which can be used to retrieve indexed FASTA subsequences, e.g.

./fasta human_g1k_v37.fasta X:10000-20000

Like samtools, the result is output to stdout, but unlike samtools only the sequence itself is output, which may be more useful if piping to other applications etc.

ADD COMMENT • link updated 6.0 years ago by Ram 45k • written 10.0 years ago by daniel ▴ 30

Ram · Accepted Answer · 2011-07-22

27

Entering edit mode

14.1 years ago

Fred ▴ 790

You can also use the samtools package. Samtools allows you to index a fasta file. Here is an excerpt from the manual:

samtools faidx ref.fasta [region1[...]] : Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.

Once you indexed your reference, you can query it.

For example:

samtools faidx reference.fasta
samtools faidx reference.fasta lyrata:1-108

Samtools is designed to handle nucleotides, so I am not sure that it will work with protein sequences.

To do it (in perl) for several positions (given in a file formated like in your question):

open(POSITIONS,"positions.txt");
while(<POSITIONS>){
    chomp;
    my ($seqName,$begin,$end) = split(/\s/);
    open(SAMTOOLS,"/path/to/samtools faidx /path/to/reference.fasta $seqName:$begin-$end |");
    while(my $line = <SAMTOOLS>){
    print $line;
    }
    close(SAMTOOLS);
}
close(POSITIONS);

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.1 years ago by Fred ▴ 790

1

Entering edit mode

can it take batch query present in a file??

ADD REPLY • link 14.1 years ago by Palu ▴ 290

0

Entering edit mode

you can always use a small perl script with a loop to read in your query one by one and split out the output one by one

ADD REPLY • link 14.1 years ago by Vitis ★ 2.6k

0

Entering edit mode

I edited my post to take into account your comment, and thus do it for several coordinates.

ADD REPLY • link 14.1 years ago by Fred ▴ 790

0

Entering edit mode

Hi, can u please provide me a script for extracting region as follow from a file:

>gi|546867469
TAGCTACATCAGTTGAAGATGGGGTTGCTGGGTAGCTGTCAGGCATAGGAGTCCCATCTTTCTCGGAAAG
AATGACAATAAGGTTGTTAAAGGGCAATCAGTCAAGGACAGGGGGAGCCACTCAAATCAGCACATGTGGC
ATGACAGTGATGCAGCATTCCCCTCAACCACCAATTGCTCCCAGGAATAGCACCTGGCTGGTCTGGCATC
TCGGGATGAGGGCTAAGGCACAGTGCCATTAGTGCCAAGAACCAAAGATCTGAGTCTTTGCAGACCCTAA
AAATTAAATCCTTTCTGTGCCCCGGGTGGAGGGGGAGAGTTCAAGTCACTTTATAATATTTGCCTCATAT
ACATATCACCTGTCCCCTTCCACCTCCTTTTTCTTGCAGTCCTAATTATCCAAGTAGCATCTACAACTCT

region to extract from 150 to 160 20-20 up/down stream

Kindly help me to extract this region. with perl script or awk command.

Thanku.

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 10.0 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

Hi,

I have a genome sequence of different chromosome file and a region file. I need to fetch sequence according region position.

kindly help if you got solution already using perl or any other.

file 1:

5087 5287
20254 23084
28710 28811
30468 30566

File 2:

>Pt 1154478:1
ATGGGCGAACGACGGGAATTGAACCCGCGATGGTGAATTCACAATCCACTGCCTTAATCC
ACTTGGCTACATCCGCCCCTACGCTACTATCTATTCTTTTTTGTATTGTCTAAAAAAAAA
AAAAAATACAAATTTCAATAAAAAATAAAAAAAGGTAGCAAATTCCACCTTATTTTTTTT
CTAATAAAAAATATATAGTAATTTTTTATTATTTATTATTATTATTTATTATTAATATAA
TAAATAAAGTAAAATATGATACTCTATAAAAATTTGCTCATTTTTATAGAAAAAAACGAG
TAATATAAGCCCTCTTTCTTATTTAAAGAAGGCTTATATTGCTCGTTTTTTACTAAACTA
GATCTAGACTAACACTAACGAATTATCCATTTGTAGATGGAGCCTCAACAGCAGCTAGGT
CTAGAGGGAAGTTGTGAGCATTACGTTCATGCATAACTTCCATACCAAGGTTAGCACGGT
TAATAATATCAGCCCAAGTATTAATAACACGTCCTTGACTATCAACTACTGATTGGTTGA

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 9.8 years ago by Kumar ▴ 170

0

Entering edit mode

Please help me. i am a window user.

Samtools is not a great solution for a Windows user. You might use pyfaidx, since it works well on Windows.

faidx --delimiter=' ' --bed regions.bed input.fasta

The above command will split your fasta identifiers on spaces, so that you can use the regions you provided to query just the species names.

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 10.0 years ago by Matt Shirley 10k

0

Entering edit mode

pyfaidx showing command not found

ADD REPLY • link 10.0 years ago by Kumar ▴ 170

0

Entering edit mode

If you"ve installed pyfaidx using pip install --user pyfaidx then the script will be at $HOME/.local/bin/faidx and if installed using sudo pip install pyfaidx the script will be at /usr/local/bin/faidx

ADD REPLY • link updated 6.0 years ago by Ram 45k • written 10.0 years ago by Matt Shirley 10k

0

Entering edit mode

Can you do this with Rsamtools or another Bioconductor packages?

ADD REPLY • link 7.7 years ago by biomagician ▴ 410