Question

How To Use Coordinates In Order To Extract Sequences In Fasta File?

3

Entering edit mode

11.7 years ago

srpatel144 ▴ 30

I have obtained a .coords file from MUMmer after aligning my assembly to the reference genome.

I have been asked to write a script that outputs 2 fasta files,

1 which will contain the sequences that were aligned to the reference and 1 which will contain the sequences that were not aligned to the sequence.

I'm not sure how to go about this. How am I supposed to use the coordinates for the sequences that aligned in order to do this?

Also, won't there be parts of sequences that don't align? What/how do I go about collecting those pieces and putting it in the 2nd fasta file?

genome alignment fasta • 26k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 11.7 years ago by srpatel144 ▴ 30

0

Entering edit mode

Error: cannot construct subsequence with negative offset or length < 1

Why do I get this error? Can somebody help me?

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by syxbestmayer ▴ 20

0

Entering edit mode

Probably because your end < start, but what are you trying to do?

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by Zaag ▴ 870

0

Entering edit mode

I have found my error, it is my question of format. Thank you!

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by syxbestmayer ▴ 20

Ram · Answer 1 · 2013-09-13

5

Entering edit mode

11.7 years ago

Aaronquinlan 12k

There is also bedtools getfasta. It uses Erik's code under the hood.

$ cat test.fa
>chr1
AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

$ cat test.bed
chr1 5 10

$ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

$ cat test.fa.out
>chr1:5-10
AAACC

Docs: http://bedtools.readthedocs.org/en/latest/content/tools/getfasta.html

And it is wrapped in pybedtools as well: http://pythonhosted.org/pybedtools/autodocs/pybedtools.BedTool.sequence.html?highlight=fasta

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 11.7 years ago by Aaronquinlan 12k

1

Entering edit mode

I wish we could once and for all decide between 0 and 1-based numbering and then whatever we choose, use that universally. Here, I would have said that 5-10 is AAAACC because there's no zeroth nucleotide. I would say that most biologists would have thought the same..

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 9.6 years ago by 5heikki 11k

0

Entering edit mode

The fasta header coordinates are definitely confusing, but here 0-based makes sense, as BED files are using 0-based half-open coordinates. "Use that universally" won't ever be practical because there are too many standards already in use.

ADD REPLY • link 9.6 years ago by Matt Shirley 10k

Ram · Answer 2 · 2013-09-12

I think samtools faidx feature exactly does what you need. If you give a reference genome and coordinates as input it will extract the sequence. Check samtools See below:

faidx samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format.

You will have to first create the fasta indexes of the reference genome fasta file and then use this command.

score 2 · Answer 3 · 2013-09-13

2

Entering edit mode

11.7 years ago

Matt Shirley 10k

This is a shameless plug, but I just uploaded a pure python implementation of faidx to GitHub. I'm calling it pyfaidx. It will also work for you. You might also check out:

FastaHack
brentp's fastahack-python, which inspired my implementation

Both of which require compiled C code, but aren't actually much faster than the pure python version.

ADD COMMENT • link 11.7 years ago by Matt Shirley 10k

0

Entering edit mode

nice! I have a Fai pure-python one here: https://gist.github.com/brentp/1842507 that I never polished up.

ADD REPLY • link 11.7 years ago by brentp 24k