Extract User Defined Region From An Fasta File
5
11
Entering edit mode
13.4 years ago
Palu ▴ 290

I have a file with multi fasta format sequence

>lyrata FEN1
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVVSEVIPQYTGSQGTKCLLRQYKPAISNKVSSLHAFGWNSSKNIVSIRLFSLSWG
FNLSEIPESTTKGAANKKTKGAGGRKKK

>Arabidopsis thaliana FEN1
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
KEDIEKYSKRTVKVTKQHNDDCKRLLRLMGVPVVEATSEAEAQCAALCKSGKVYGVASED
MDSLTFGAPKFLRHLMDPSSRKIPVMEFEVAKILEELQLTMDQFIDLCILSGCDYCDSIR
GIGGQTALKLIRQHGSIETILENLNKERYQIPEEWPYNEARKLFKEPDVITDEEQLDIKW
TSPDEEGIVQFLVNENGFNIDRVTKAIEKIKTAKNKSSQGRLESFFKPVANSSVPAKRKG
NLICLVASEVIPNILDHRVPNACSGSINQPSAIKSPLCMLLDGTPPITLLALGCSLYPGL
PTFRFTWLEEIPESTTKGAANKKTKGAGGRKKK

>Brachypodium distachyon - II FEN1
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT
KKHNDDCKKLLRLMGVPVVEAPGEAEAQCASLCKNHKAYAVASEDMDSLTFGSLRFLRHI
TDLSFKRSPVTEFEVPKVLEELGLTMDQFIDLCILSGCDYCENIKGIGGQRALKLIRQHG
CIEEVVQNLNNRFTVPEDWPYQEVRTLFKEPNVCTEIPDFQWTSVDKEGIVNFLAIENSF
SSDRVEKAVQKIKAARDRYSPGRVKLLTPVANLSGSITKKEPECVLGSSGQGLKSWSALQ
VCRSSSSDFRYRSYYSSKPLVLGRQSGFHGIPHAFSLI

and i want to extract sequence of defined region. these parameters are present in other file

lyrata            1      108 
lyrata             147    234 
Arabidopsis          1    108 
Arabidopsis        147    234 
Brachypodium         1     93 
Brachypodium       132    219

I have asked similar question before and got an awk solution but it gives error. Please help me. i am a window user.

sequence fasta • 77k views
ADD COMMENT
1
Entering edit mode

HI, thanks for answering my previous and this question.. I found a very good solution of my problem ans love to share with ther have same problem. visit this page. it's working good

ADD REPLY
0
Entering edit mode

This is not a discussion forum, It's Q&A. Youve posted this as an 'answer' which doesn't make sense. To reply to another answer use the comment button underneath it, like I am here.

ADD REPLY
27
Entering edit mode
13.4 years ago
Fred ▴ 790

You can also use the samtools package. Samtools allows you to index a fasta file. Here is an excerpt from the manual:

samtools faidx ref.fasta [region1[...]] : Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.

Once you indexed your reference, you can query it.

For example:

samtools faidx reference.fasta
samtools faidx reference.fasta lyrata:1-108

Samtools is designed to handle nucleotides, so I am not sure that it will work with protein sequences.

To do it (in perl) for several positions (given in a file formated like in your question):

open(POSITIONS,"positions.txt");
while(<POSITIONS>){
    chomp;
    my ($seqName,$begin,$end) = split(/\s/);
    open(SAMTOOLS,"/path/to/samtools faidx /path/to/reference.fasta $seqName:$begin-$end |");
    while(my $line = <SAMTOOLS>){
    print $line;
    }
    close(SAMTOOLS);
}
close(POSITIONS);
ADD COMMENT
1
Entering edit mode

can it take batch query present in a file??

ADD REPLY
0
Entering edit mode

you can always use a small perl script with a loop to read in your query one by one and split out the output one by one

ADD REPLY
0
Entering edit mode

I edited my post to take into account your comment, and thus do it for several coordinates.

ADD REPLY
0
Entering edit mode

Hi, can u please provide me a script for extracting region as follow from a file:

>gi|546867469
TAGCTACATCAGTTGAAGATGGGGTTGCTGGGTAGCTGTCAGGCATAGGAGTCCCATCTTTCTCGGAAAG
AATGACAATAAGGTTGTTAAAGGGCAATCAGTCAAGGACAGGGGGAGCCACTCAAATCAGCACATGTGGC
ATGACAGTGATGCAGCATTCCCCTCAACCACCAATTGCTCCCAGGAATAGCACCTGGCTGGTCTGGCATC
TCGGGATGAGGGCTAAGGCACAGTGCCATTAGTGCCAAGAACCAAAGATCTGAGTCTTTGCAGACCCTAA
AAATTAAATCCTTTCTGTGCCCCGGGTGGAGGGGGAGAGTTCAAGTCACTTTATAATATTTGCCTCATAT
ACATATCACCTGTCCCCTTCCACCTCCTTTTTCTTGCAGTCCTAATTATCCAAGTAGCATCTACAACTCT

region to extract from 150 to 160 20-20 up/down stream

Kindly help me to extract this region. with perl script or awk command.

Thanku.

ADD REPLY
0
Entering edit mode

Hi,

I have a genome sequence of different chromosome file and a region file. I need to fetch sequence according region position.

kindly help if you got solution already using perl or any other.

file 1:

5087 5287
20254 23084
28710 28811
30468 30566

File 2:

>Pt 1154478:1
ATGGGCGAACGACGGGAATTGAACCCGCGATGGTGAATTCACAATCCACTGCCTTAATCC
ACTTGGCTACATCCGCCCCTACGCTACTATCTATTCTTTTTTGTATTGTCTAAAAAAAAA
AAAAAATACAAATTTCAATAAAAAATAAAAAAAGGTAGCAAATTCCACCTTATTTTTTTT
CTAATAAAAAATATATAGTAATTTTTTATTATTTATTATTATTATTTATTATTAATATAA
TAAATAAAGTAAAATATGATACTCTATAAAAATTTGCTCATTTTTATAGAAAAAAACGAG
TAATATAAGCCCTCTTTCTTATTTAAAGAAGGCTTATATTGCTCGTTTTTTACTAAACTA
GATCTAGACTAACACTAACGAATTATCCATTTGTAGATGGAGCCTCAACAGCAGCTAGGT
CTAGAGGGAAGTTGTGAGCATTACGTTCATGCATAACTTCCATACCAAGGTTAGCACGGT
TAATAATATCAGCCCAAGTATTAATAACACGTCCTTGACTATCAACTACTGATTGGTTGA
ADD REPLY
0
Entering edit mode

Please help me. i am a window user.

Samtools is not a great solution for a Windows user. You might use pyfaidx, since it works well on Windows.

faidx --delimiter=' ' --bed regions.bed input.fasta

The above command will split your fasta identifiers on spaces, so that you can use the regions you provided to query just the species names.

ADD REPLY
0
Entering edit mode

pyfaidx showing command not found

ADD REPLY
0
Entering edit mode

If you"ve installed pyfaidx using pip install --user pyfaidx then the script will be at $HOME/.local/bin/faidx and if installed using sudo pip install pyfaidx the script will be at /usr/local/bin/faidx

ADD REPLY
0
Entering edit mode

Can you do this with Rsamtools or another Bioconductor packages?

ADD REPLY
6
Entering edit mode
13.4 years ago
Christian ★ 3.1k

BioPerl is your friend:

use Bio::DB::Fasta;
my $db = Bio::DB::Fasta->new('myfile.fasta');
my $seq = $db->seq('lyrata', 1 => 108);

See also the documentation for Bio::DB::Fasta

ADD COMMENT
3
Entering edit mode
7.4 years ago
tuomastik ▴ 30

It's also possible to use bedtools getfasta, which:

extracts sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file.

ADD COMMENT
1
Entering edit mode
13.4 years ago
Vitis ★ 2.6k

the command-line blast package will do. first use formatdb with -o option to build a database of your fasta, then use fastacmd with -s -L options to get the regions.

ADD COMMENT
0
Entering edit mode

could u give full command example for that...

thanks

ADD REPLY
0
Entering edit mode

-o option showing unknown argument error

ADD REPLY
1
Entering edit mode
9.3 years ago
daniel ▴ 30

Just thought I'd give my C++ library, Bioio, a mention here. Although primarily a C++ API for FASTA/Q I/O, I also include a small command line application demo fasta.cpp which can be used to retrieve indexed FASTA subsequences, e.g.

./fasta human_g1k_v37.fasta X:10000-20000

Like samtools, the result is output to stdout, but unlike samtools only the sequence itself is output, which may be more useful if piping to other applications etc.

ADD COMMENT

Login before adding your answer.

Traffic: 1713 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6