Extract Fasta Sequences Sub Sets

Entering edit mode

11.8 years ago

prp291 ▴ 70

I have a FASTA file with several sequences, like this:

>AT1G01250.1 | Symbols: | Integrase-type DNA-binding superfamily protein | chr1:104731-105309 REVERSE LENGTH=192 MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQVAAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVDGDLYQYQFMACL

>AT1G03800.1 | Symbols: ERF10, ATERF10 | ERF domain protein 10 | chr1:957261-957998 REVERSE LENGTH=245 MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRPWGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSATPVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAVAGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLDLNASP

I have another file like this

AT1G01250   45  102
AT1G03800   65  109

Now I want to extract the sequences from file using the coordinates given in file 2. For example, I want to extract the portion of >AT1G01250 from position 45 to position 102. Any Help will be greatly appreciated. I am a Windows user.

fasta • 16k views

ADD COMMENT • link updated 9.3 years ago by Stephane Plaisance ▴ 460 • written 11.8 years ago by prp291 ▴ 70

Entering edit mode

cross posted on SO: http://stackoverflow.com/questions/19159119

ADD REPLY • link 11.8 years ago by Pierre Lindenbaum 166k

Entering edit mode

11.8 years ago

Nicolas Rosewick 11k

Install linux .
a little perl script might be appropried ( load the fasta file (with bioperl) and the position file, cut with substring, write in a file )

ADD COMMENT • link 11.8 years ago by Nicolas Rosewick 11k

Entering edit mode

Serious question, why are people up voting this answer? No offense, but it does not come close to answering the question and is technically incorrect. This is especially confusing because there are multiple answers to the question. Is it because people think "install linux" is funny? I don't think that is very helpful and I wish we as a community would use votes more judiciously to promote responses that are correct.

ADD REPLY • link 11.8 years ago by SES 8.6k

Entering edit mode

The question is serious, and so is the answer. It answers a meta-question that is implicit from the OP. Namely, that person is in a research context where he/she will have to play with fasta files or other genetic-oriented formats and to some manipulations that require scripts or custom approaches based on ongoing investigation of the data. In such a context, the meta-question is: What tools should I use to explore such data sets in a flexible and powerful manner? The answer is to install and use Linux or another UNIX-compatible system (eg: MacOSX) and learn some programming skills (R, Python, Perl...). I do agree however that this is not an answer to the OP per se and we could encourage such suggestions to be posted as comments instead, where their value would still be high without distracting from the OP getting a usable answer.

ADD REPLY • link 11.8 years ago by Eric Normandeau 11k

Entering edit mode

Certainly, this response answers a question, but not the question, which you seem to agree with. That is not the issue though, and I don't have a problem with the answer itself. The issue is that there are actual working answers to the question and this response has the most up votes. I don't feel that it's too idealistic to think that the most helpful answers are the ones that should be promoted and rewarded.

ADD REPLY • link 11.8 years ago by SES 8.6k

Entering edit mode

You don't need linux to use Perl or Bioperl.

ADD REPLY • link 11.8 years ago by SES 8.6k

Entering edit mode

I know but for bioinformatics yes ;)

ADD REPLY • link 11.8 years ago by Nicolas Rosewick 11k

Entering edit mode

Yes I know that but I am looking for that PERL script If any one can help me with

ADD REPLY • link 11.8 years ago by prp291 ▴ 70

Entering edit mode

Yes, but next week you will be looking for another Perl script, and then another, and another. I think you would do yourself a big favour if you take the time to learn Linux and either Python or Perl. If you think bioinformatics will be important in your project, make this priority number 1. In the long run, you will a) save time b) do better science c) explore more interesting questions and d) have a better career. ;)

ADD REPLY • link 11.8 years ago by Eric Normandeau 11k

Entering edit mode

11.8 years ago

Pierre Lindenbaum 166k

using windows and your browser: if your set of sequences is not SOOO big, you can try to use the following HTML5 page (adapted from How to compute the nucleotide composition of large number of sequences (preferably with anonline tool))

	<html xmlns='http://www.w3.org/1999/xhtml'>

	<body>
	<form>
	<label for="ZmlsZXMKfiles">Select a FASTA file</label>:<input type="file" id="ZmlsZXMKfiles" multiple="true"/>
	<label for="e0503f587a8d79efiles">Select a BED file</label>:<input type="file" id="e0503f587a8d79efiles" multiple="true"/>
	</form>
	<pre id="ac921c19f1aff"></pre>
	<script type="application/ecmascript">

	var all_fasta={};
	var all_bed=[];

	function findSubSequences()
	{


	var pre=document.getElementById('ac921c19f1aff');
	while(pre.hasChildNodes()) pre.removeChild( pre.firstChild );

	if(all_bed.length==0) return;
	var i=0;
	for(i=0;i < all_bed.length;i++)
	{
	if(!(all_bed[i].name in all_fasta)) continue;
	var seq=all_fasta[ all_bed[i].name ];
	var start=all_bed[i].start;
	var end=all_bed[i].end;
	if(start< 0 \|\| start> end \|\| end >= seq.length) continue;
	pre.appendChild(document.createTextNode(">"+all_bed[i].name+":"+start+"-"+end));
	pre.appendChild(document.createElement("br"));
	pre.appendChild(document.createTextNode(seq.substring(start,end)));
	pre.appendChild(document.createElement("br"));
	}

	}


	function readingData(evt)
	{
	}

	function endReadFasta(e)
	{

	if(e.target.result==null) return;
	var lines=e.target.result.split(/[\n\r]/);
	var dna="";
	var title="";
	var i=0;
	var line;
	all_fasta={};
	for(;;)
	{
	if(i==lines.length \|\| (line=lines[i].replace(/^\s+\|\s+$/g,""))[0]=='>')
	{
	if(dna.length > 0)
	{
	all_fasta[title]=dna;
	}
	if(i===lines.length) break;
	title=line.substring(1);
	dna="";
	}
	else
	{
	dna+=line;
	}
	++i;
	}
	findSubSequences();
	}

	function endReadBed(e)
	{
	if(e.target.result==null) return;

	all_bed=[];

	var lines=e.target.result.split(/[\n\r]/);
	var i=0;

	for(i=0;i < lines.length;i++)
	{
	if(lines[i].length==0) continue;
	var tokens=lines[i].split(/\t/,4);
	if(tokens.length < 3) continue;
	all_bed.push({"name":tokens[0],"start": parseInt(tokens[1]),"end": parseInt(tokens[2])});

	}

	findSubSequences();
	}

	function handleFileSelectBed(evt)
	{
	var files = evt.target.files; // FileList object
	if(files.length==0) return;

	for(var i=0;i < files.length;++i)
	{
	var reader = new FileReader();
	reader.onprogress=readingData;
	reader.onloadend=endReadBed;
	reader.readAsText(files[i]);
	}
	}

	function handleFileSelectFasta(evt)
	{

	var files = evt.target.files; // FileList object

	if(files.length==0) return;


	for(var i=0;i < files.length;++i)
	{
	var reader = new FileReader();
	reader.onprogress=readingData;
	reader.onloadend=endReadFasta;
	reader.readAsText(files[i]);
	}
	}

	document.getElementById('ZmlsZXMKfiles').addEventListener('change', handleFileSelectFasta, false);
	document.getElementById('e0503f587a8d79efiles').addEventListener('change', handleFileSelectBed, false);
	</script>

	</body>
	</html>

view raw biostars82788.html hosted with ❤ by GitHub

	gi\|27592135 1 10
	gi\|84131965 7 16
	gi\|83671954 5 20

view raw intervals.bed hosted with ❤ by GitHub

	>gi\|27592135
	GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACT
	TCACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCC
	TGGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
	TATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGGC
	AGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAG
	CTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCACATGGCTCTGTTTGATCCCAG
	AAGTGATGACTACTTAGTGGTAAAAACACATTTCCAGACACACAACTTCAGAAAATGAGTGCAAGCTTCA
	AGTCTGCCCTTTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTGGCTCAATTCT
	TGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTCTCACGCCCTCTGAATCAGTGCCTTGAGGTG
	CCTTCAGAAGGCTTGTGATGGTTAGNNNTNGCATTTTGGTT

	>gi\|84131965
	TTTTCTGCGTCTCTTCTCCCCAAGCTGCTCAGTAAAACAAGGTTTGCCCCTTGAACTCACCGGTATCCTG
	TGCAGCTGCAGCCCAAGCCACGTGTGGCAAAGGGCAGGTGGCCTCCAAGGCGCCGTGCAAACCCAGAGCC
	CATGGGTCAAGGGTGTGGAGGGCAGCGCTGTGTGGCTGGTTGGGGCGGTCAATGTGGCCCCTCAGGGACA
	CCCCGCACCTGTGAGAGGGGGTGAGCCGGGAGTGGGGGGAGGCGGAGCGCACGGCGACCCACCTCGCGAC
	TGCCTGATGGACTGTGTTCTCTCCCCAGAGACTGATGGAGAGGCAGAGACGGAAGGCGGACATCGAGAAG
	GGGCTGCAGTTCATTCAGTGAGTGTGCGGTGGGGCGGGCCGGGCGGGTCCTCTGAGGCGGGCAAACTCCA
	CTCCCCACCCCGCCCCGCCCCGCGCATGCGCTCTCCTGCTCTCCAGGCTCCCGGGAGTGTTGGGGAGCGA
	CCAGGTGGGACACGCTAGAAGGGGGTTGGAGTGGCAGAGCCAGGGGTGCCGCTAACCCTGAGCCATGAAC
	CCACACGCCCAGGGACTCCCAGCTCCTGGTTTCCCTTTGGTCTTGTCCCATATGTCATTTTGCTCTCACC
	CTCTACGTTGGCTCCTACACTCCCTGTTAAT

	>gi\|83671954
	GCTGGTACCGGTCCGGAATTCCCGGGATATCGTCGACCCACGCGTCCGCTGGGAACCTCGCACCATGCTG
	GCCTCGTGTCTCCTCGCCGCGCTGGTGCTCTCGCTGGTGGCGGACGCTTCGCACTTCGACCGCACCGTGT
	CCCACAGTCGCATCCGCGGCCGGACACAAGGGGCCAACGTGTGCGCCGTGCAGAAGGTGGTGGACACGGA
	GAAGAAATACTACTCCAACTGCAAGCAGTGGTACCAGAAGCAGATCTGCGGAAAGAAAACGATTGTGACC
	TATGAGTGCTGTCCTGGCTACGTCAGAGTGGACGGCGTCGATGGCTGTTCGGCAATCACCCCCATCGTGA
	ACGTTTACGAGACGCTGGAACCCATCGAGGCCACTCTCACGCAGAAATACTCGAACCAGTCGGGACTGCG
	GCCTGAGATCGAGGGGCCTGGCGCTCACACCATGTTCGCTCCCAGCAATGAGGCGTGGTTAGAGCTACCT
	AAGGAAGTTCGCGACTCGCTCACGACCAACGTGAACATCGAGCTGCTCAACGCTCTGCGCTACCACATGG
	TGGACCGGCGGCTGCTCACGAGCGACCTCAAGGACGGCACGGTGCTCACCTCCATGTACGAGAAGCAGAA
	GCTCTACGTCAACCACTACCCCAACGGGATCATCACCATGAACTGCGCCCGTCTCATCCGCCCCAACCAC
	CTGGCGACCAACGGCGTGGTGCACGTGATCGACCGCGTCGTCGTGCCCGTGTCAAACCAGATCGGGCGAT
	GTCATCAGCTACGACGAGGACATGGAGTCGATGCCGGGCGGGCAGTGGAGGCCTCCAGGACTGATGCCAT
	TTGCTTCAACTTCGGGAGGGGGCCCATCCACTCTCTTTCGGTGCCCCA

	>gi\|72201761
	GCAATCAATGTGCCATTTACTGTGGTTATTTTTAATTATGGTTTTTGTAAAGACCCCGACTTCAGATATT
	ACTTGACTTCCTGAATGCGAAAAACTTCCTTGAGGACTCATGAAGGCGCTTGCTCAAAGTGCCACGTATC
	TGATGTGATCAGAAATGGACACTGATCATCGTGGGTTAATCTACTGAATGTAAAGAAGGGATTTAAATAT
	GAAGACTGTTACAGAACTAGATTGTACATAACATTAGAGCAAGATTTTGACTAGAGTTGCATGTTTTTGT
	AAATAACATTGTACAGCCAGGTTTAGTGGGAACGAGCTGGAGTTCTCTTTCCTCAGTCACACATGAAATA
	ACTTGGGGCTGGTCAACAGATGGGCACTGGCCGTTTCCTATATGCTGCTGTTCAAATGATGGCTTTGAGG
	ATTTTGTTTATATTTGGATCAAGTGCAAGTTTATGTACAACTTTGTACATAATCGTATCATTATTGTA

	>gi\|72201725
	CCTCTCGCTATTGGTCATGACCCGAGTACTCTTGGGGGTGGCACTGAACAGGTTAATGCGTTGACTTACA
	CCAGTGTCGAGAGGTTTGGAGGAGTGGGGAAAGAAAGTGGAGGAGCGTGTTCTCGTTTGTGTGATCCAAG
	CTGAGCAATGGCAGAAGGAACGGTTCAGTTACCGCGTGACTGGAACTAACGCAAGTTCTCAAGTCCGTAA
	GTTATAGGAGCGGAATTAGGCCATTTGGCCCATCAAGTCTCCTCCACCATTCATTCATGGCTGACCTATC
	TCTCCCTCCTAACCCCATTCTCCTGCCTTCTCCCCATAACCACTGACACCCGTACTAATCAATCTGCCAA
	TCTCCGCCTT

	>gi\|71192985
	GCAATCAATGTGCCATTTACTGTGGTTATTTTTAATTATGGTTTTTGTAAAGACCCCGACTTCAGATATT
	ACTTGACTTCCTGAATGCGAAAAACTTCCTTGAGGACTCATGAAGGCGCTTGCTCAAAGTGCCACGTATC
	TGATGTGATCAGAAATGGACACTGATCATCGTGGGTTAATCTACTGAATGTAAAGAAGGGATTTAAATAT
	GAAGACTGTTACAGAACTAGATTGTACATAACATTAGAGCAAGATTTTGACTAGAGTTGCATGTTTTTGT
	AAATAACATTGTACAGCCAGGTTTAGTGGGAACGAGCTGGAGTTCTCTTTCCTCAGTCACACATGAAATA
	ACTTGGGGCTGGTCAACAGATGGGCACTGGCCGTTTCCTATATGCTGCTGTTCAAATGATGGCTTTGAGG
	ATTTTGTTTATATTTGGATCAAGTGCAAGTTTATGTACAACTTTGTACATAATCGTATCATTA

	>gi\|71192714
	CCTCTCGCTATTGGTCATGACCCGAGTACTCTTGGGGGTGGCACTGAACAGGTTAATGCGTTGACTTACA
	CCAGTGTCGAGAGGTTTGGAGGAGTGGGGAAAGAAAGTGGAGGAGCGTGTTCTCGTTTGTGTGATCCAAG
	CTGAGCAATGGCAGAAGGAACGGTTCAGTTACCGCGTGACTGGAACTAACGCAAGTTCTCAAGTCCGTAA
	GTTATAGGAGCGGAATTAGGCCATTTGGCCCATCAAGTCTCCTCCACCATTCATTCATGGCTGACCTATC
	TCTCCCTCCTAACCCCATTCTCCTGCCTTCTCCCCATAACCACTGACACCCGTACTAATCAATCTGCCAA
	TCTCCGCCTT

	>gi\|63103499
	ATGTCACAGGGCAGTCTTAATCTGTGTGAAGGGGGCCTAAAGGCCCATTCATACCTCACGTAAAAGACGG
	ATACGTGTGGAGTGTTTGATACGTTCTAACCGTCGATTTCGTCCGTATTTTGACACGAAAATTGGGAGCT
	TACGATTACGGACGAAACGGAGCAATACTACCGTAAGAGGTGGGGGCGCTATTGAGTTTGTAGTACAAAA
	TGTCAACAAAATCACGAAGAAGATTAGAATTCTGGGCTTTAAACAATGGGGGTTTTGAGGAACGACTTTC
	GGAGATTGTCCGCAACTACCCACATTTATATGATGAGTCGTGTCCGGGGCACAGGGACAAACAAAAAGTT
	ATGAATAGCCTTCAGGAAATCGGGAGGCTCTGGCCATGACAGGGGATGTCGTAAAGTCCAAGTGGGCCGC
	AATTAGGGAGCGCCCAGCACTGCCACCTACAGGAAATATTGGTTATTGCGCACCTCAACGGACGGAGGTA
	TGAAGGAGTACGGACAAAAATACGGACGTGTAGGCCAAACGTAGGGTATGAATGGGCCTTAAGATGTGTT
	AAAGAACCGACAGCAGCCCCATCCGTGCTGCTTGATGGCACGTTGCTTCGGTCGTTACAGAAAAGCAAGG
	GGAGGCGCAC

	>gi\|63000787
	ATGTCACAGGGCAGTCTTAATCTGTGTGAAGGGGGCCTAAAGGCCCATTCATACCTCACGTAAAAGACGG
	ATACGTGTGGAGTGTTTGATACGTTCTAACCGTCGATTTCGTCCGTATTTTGACACGAAAATTGGGAGCT
	TACGATTACGGACGAAACGGAGCAATACTACCGTAAGAGGTGGGGGCGCTATTGAGTTTGTAGTACAAAA
	TGTCAACAAAATCACGAAGAAGATTAGAATTCTGGGCTTTAAACAATGGGGGTTTTGAGGAACGACTTTC
	GGAGATTGTCCGCAACTACCCACATTTATATGATGAGTCGTGTCCGGGGCACAGGGACAAACAAAAAGTT
	ATGAATAGCCTTCAGGAAATCGGGAGGCTCTGGCCATGACAGGGGATGTCGTAAAGTCCAAGTGGGCCGC
	AATTAGGGAGCGCCCAGCACTGCCACCTACAGGAAATATTGGTTATTGCGCACCTCAACGGACGGAGGTA
	TGAAGGAGTACGGACAAAAATACGGACGTGTAGGCCAAACGTAGGGTATGAATGGGCCTTAAGATGTGTT
	AAAGAACCGACAGCAGCCCCATCCGTGCTGCTTGATGGCACGTTGCTTCGGTCGTTACAGAAAAGCAAGG
	GGAGGCGCA

	>gi\|56844770
	CTACGAATGGCCTAAAGAAGCCGTGGAAATCCAGAAAGTGTAATTATTTGTGCCAATTAAGTCTTTGACT
	AGAATATTTGATCTCAAAGTCTGGTAACCCTAATCGCAAGTCCAAACATACTGCCTTCGGAAGTGATTAT
	TGCATTGCTTTAAAAAAACAGTACAATTACTGTTCCAAAGCAAGTCGCTCTCATTTATTAATTTCAAACC
	AACAAATGTTCTTCAGTGACAGGTGCACAGCAGTAATCCCGTGAGAGTTTATTGCATTTTTTCTTATCAT
	ATTATTTTTTATCTTGCAGGAATTATTTAGCGTTTCATTTTTTTAATTAATGTTTACTCATTTTAAACAA
	ATATTGCTTGCCATTTGAACAATTTGAATAGGTATGTTATCTCTGCATTACATTACTATTATGACAAAGA
	TCAAAAGACAGGTATTGAAAAGGCATTTAATTGTCTGGAGGCTCTGGGTTCATATTCACTGAACTGTATA
	ATCTTACACAAGACACATAGCTATGGCGTCATTAGCATTTTGCACAAAATAAGTTAAATTCCTTTTCTAA
	ACATGGTAACCATTGTCTCTGGAGTCATGTTTATAATGCCACACTTTTAGTTTGGATTTCTGTTTTTCTT
	GGGTCTACTAGTCTGCCTTTGGGAATAAGGAGTCTAATTTAGCACTGTAAATAGTGGATTGATGCCGGTC
	TCTCGGAATCTAAGCTAAAACTGTGCCCGTATCTAAA

	>gi\|56844653
	CTACGAATGGCCTAAAGAAGCCGTGGAAATCCAGAAAGTGTAATTATTTGTGCCAATTAAGTCTTTGACT
	AGAATATTTGATCTCAAAGTCTGGTAACCCTAATCGCAAGTCCAAACATACTGCCTTCGGAAGTGATTAT
	TGCATTGCTTTAAAAAAACAGTACAATTACTGTTCCAAAGCAAGTCGCTCTCATTTATTAATTTCAAACC
	AACAAATGTTCTTCAGTGACAGGTGCACAGCAGTAATCCCGTGAGAGTTTATTGCATTTTTTCTTATCAT
	ATTATTTTTTATCTTGCAGGAATTATTTAGCGTTTCATTTTTTTAATTAATGTTTACTCATTTTAAACAA
	ATATTGCTTGCCATTTGAACAATTTGAATAGGTATGTTATCTCTGCATTACATTACTATTATGACAAAGA
	TCAAAAGACAGGTATTGAAAAGGCATTTAATTGTCTGGAGGCTCTGGGTTCATATTCACTGAACTGTATA
	ATCTTACACAAGACACATAGCTATGGCGTCATTAGCATTTTGCACAAAATAAGTTAAATTCCTTTTCTAA
	ACATGGTAACCATTGTCTCTGGAGTCATGTTTATAATGCCACACTTTTAGTTTGGATTTCTGTTTTTCTT
	GGGTCTACTAGTCTGCCTTTGGGATAAAGGAGTCTAAATTTAGCACTGTAATAGTGGATTGATGCCGGTC
	TCTCGGAATCTAAAGCTAAACTGTGCCCGTATCTA

view raw sequences.fasta hosted with ❤ by GitHub

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 11.8 years ago by Pierre Lindenbaum 166k

Entering edit mode

Thank you very much for your great help

ADD REPLY • link 11.8 years ago by prp291 ▴ 70

Entering edit mode

11.8 years ago

Devon Ryan 105k

The following will work in R, which you presumably already have installed if you're doing anything with bioinformatics. This assumes the fasta file is called "blah.fa" and the regions are stored in a file called "regions.txt". I should note that I changed the first column of the regions so that they match the fasta file (e.g., change AT1G01250 to AT1G01250.1), though you could do this in R or do the matching differently.

library(seqinr)

f <- read.fasta("blah.fa", seqtype="AA")
regs <- read.table("regions.txt", header=F)

getAAseqs <- function(x) {
    idx <- which(names(f) == x[1])
    paste(f[[idx]][x[2]:x[3]], collapse="")
}
subseqs <- apply(regs, 1, getAAseqs)

It's not pretty, but it'll work.

ADD COMMENT • link 11.8 years ago by Devon Ryan 105k

Entering edit mode

11.8 years ago

dariober 15k

Hi- Let's try this one. It's python so it should work on Windows with any line terminator

#!/usr/bin/env python

import sys
import re

FASTA= sys.argv[1]
BED= sys.argv[2]

fasta= open(FASTA, 'U')
fasta_dict= {}
for line in fasta:
    line= line.strip()
    if line == '':
        continue
    if line.startswith('>'):
        seqname= line.lstrip('>')
        seqname= re.sub('\..*', '', seqname)
        fasta_dict[seqname]= ''
    else:
        fasta_dict[seqname] += line
fasta.close()

bed= open(BED, 'U')
for line in bed:
    line= line.strip().split('\t')
    outname= line[0] + ':' + line[1] + '-' + line[2]
    print('>' + outname)
    s= int(line[1])
    e= int(line[2])
    print(fasta_dict[line[0]][s:e])
bed.close()
sys.exit()

Save it as getFasta.py and execute it as:

python getFasta.py in.fasta in.bed > out.fasta

Note that it strips from the names of the fasta sequences everything after the dot to comply with the example sequences. The fasta file is read in memory so not very clever but it's quick if the bed file has million of intervals to extract.

ADD COMMENT • link 11.8 years ago by dariober 15k

Entering edit mode

I tried this solution also. It's working fine

ADD REPLY • link 11.8 years ago by prp291 ▴ 70

Entering edit mode

11.8 years ago

arnstrm ★ 1.9k

This is the dirty way to do it in bash command line. First, convert fasta to tabular form (with just ID and Sequence)

cut -d " " -f 1 sequences.fa | tr -s "\n" "\t"| sed -s 's/>/\n/g' > sequences.tab

Then use the coordinates file to cut the desired portion of the sequence

while read id start end; do \
g=$(grep "$id" sequences.tab | cut -f 2 | cut -c $start-$end);\
echo ">$id";\
echo $g;\
done<coordinates.txt

This will output:

>AT1G01250
YHGVRKRRWGKWVSEIREPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPE
>AT1G03800
AAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFP

I know this is not the best method, but I feel it is easier doing it with bash.

ADD COMMENT • link 11.8 years ago by arnstrm ★ 1.9k

Entering edit mode

I wish I could give this a vote for effort, but there is nothing "easier" in your answer. The OP is using Windows, so you're off-base, and you're taking advantage of features (newlines where OP will have carriage returns, no whitespace in the sequence names) that are not guaranteed. In addition, you'll lose all line wrapping from the original file. You mention that you know this is not the best method, so I hope you can take this as constructive criticism and don't get too discouraged :)

ADD REPLY • link 11.8 years ago by Matt Shirley 10k

Entering edit mode

Sorry, I missed the "windows" part! But yes, this solution works (if you're in UNIX environment AND using above sequence(s)). I said it "easier" because it only has two one-liners.

ADD REPLY • link 11.8 years ago by arnstrm ★ 1.9k

Entering edit mode

11.8 years ago

always_learning ★ 1.2k

Dear , Here is the Perl solution for this and can run it on Windows easily.

#user/bin/perl 

open (IN1,"$ARGV[0]");
open (IN2, "$ARGV[1]");
while(<IN1>){
if( /(^>.[\d|\w].*\.).*(=.*$i)/){
$fir = $1;
$last = $2;
$fir =~ s/[\>|\.]//g;
$last =~ s/[=|\d|\s]//g;
$val{$fir} = $last;
             }
                                   }
 while(<IN2>){
 @bed = split("\t", $_);
  $string = substr ($val{$bed[0]}, $bed[1], $bed[2]-$bed[1]);
   print "$string\n";
                }

Run this code as perl code-name.pl test.fasta bed.txt.

ADD COMMENT • link 11.8 years ago by always_learning ★ 1.2k

Entering edit mode

This code won't even compile and it needs a bit of effort to be a working script. There are many problems, starting with the first line. I recommend testing things before posting, and coding in a modern style so your script will work on other computers (and with recent versions of Perl). If you use modern pragmas (lexical variables, enable strictures and warnings, perform tests on arguments, etc.) you will save a lot of time debugging.

ADD REPLY • link 11.8 years ago by SES 8.6k

Entering edit mode

Yes !! Thanks for recommendation. But this code is all working on Unix Machine :) :) !! Yes I didn't write it as most optimum way and didn't add warning and all but reason was to give a workable code only !! Thanks again !!

ADD REPLY • link 11.8 years ago by always_learning ★ 1.2k

Entering edit mode

Interesting that you say it works. What version of Perl are you using, out of curiosity? I don't see how this could even compile (unless your Perl is fairly old).

ADD REPLY • link 11.8 years ago by SES 8.6k

Entering edit mode

Its perl v5.10.1 :)

ADD REPLY • link 11.8 years ago by always_learning ★ 1.2k

Entering edit mode

Thanks for the response. With 5.14+ strictures and warnings are enabled by default so this type of code will just blow up with a bunch of warnings. With 5.10 it will compile, you are correct. Though, this is a bad thing because it will fail silently, which is something to avoid.

ADD REPLY • link 11.8 years ago by SES 8.6k

Entering edit mode

Yes, I am graciously accepting that I should add the warnings and exception handler but I didn't because I gave a workable code only !! !! :) Thanks again

ADD REPLY • link 11.8 years ago by always_learning ★ 1.2k

Entering edit mode

9.8 years ago

caritogandini ▴ 40

Hi, i've noticed that the post is old, but anyway here you have a solution using BioPerl module Bio::DB::Fasta. It works using a fasta file with all your sequences and a txt file with the exact id (tab) start (tab) stop.

#!/usr/bin/perl -w

use Bio::DB::Fasta;

#Usage: extract_substring.pl file.fasta coordinates.txt (where: id, start, stop) > out.fasta

my $fasta = $ARGV[0];
my $query = $ARGV[1];
my ($id,$start,$stop);

my $db = Bio::DB::Fasta -> new($fasta);

open (IN1, $query);
  while (<IN1>) {
    ($id,$start,$stop) = split "\t";
    my $subseq = $db->subseq($id,$start,$stop);
    print ">", $id, "_", $start, "_", $stop; 
    print $subseq, "\n";
  }
close IN1;

ADD COMMENT • link 9.8 years ago by caritogandini ▴ 40

Entering edit mode

9.8 years ago

Matt Shirley 10k

I am a Windows user.

You might try using pyfaidx: it works well on Windows.

ADD COMMENT • link 9.8 years ago by Matt Shirley 10k

Entering edit mode

9.3 years ago

Stephane Plaisance ▴ 460

1 install samtools

2 create the index matching your genome of interest (if genome of course)

samtools faidx genome.fa

3 use a derived command of the type

samtools faidx genome.fa chr2:10000-12000

to extract the region it goes much faster than perl as it has a hash to retrieve data from the fasta file

;-)

ADD COMMENT • link 9.3 years ago by Stephane Plaisance ▴ 460

Entering edit mode

Note that the OP said "I am a Windows user". AFAIK samtools does not play nicely on Windows. However, pyfaidx works great.

pip install pyfaidx
faidx genome.fa --bed regions.bed

This seems to be what the OP wanted.

ADD REPLY • link 9.3 years ago by Matt Shirley 10k