How To Extract Specific Coordinates From Multifasta File
5
6
Entering edit mode
13.7 years ago
Florianino ▴ 70

Hi, I have a genome file in multifasta format

>chr1
ATATATACCGA

>chr2
TGGCGTATTAT

...

and I would like to extract specific coordinates from specific sequences. Coordinates are stored in a three columns file table

chr1 2 4
chr2 4 6
chr2 3 6
...

Ideally the output would compile the extracted sequences in a single file indicating the chromosome and coordinates in the header:

>chr1_2_4
TAT
>chr2_4_6
CGT
...

Could anyone help me with this?

Thanks a lot in advance Florianino

extraction fasta coordinates • 12k views
ADD COMMENT
8
Entering edit mode
13.7 years ago

Here is my solution. Please have the seqinr package installed and don't forget to chmod +x

#!/usr/bin/env Rscript

library(seqinr)

seqs <- read.fasta("a001.fasta")
locs <- read.delim("a001.coords", header=F, sep=" ") # Tabs? Put sep="\t"
outf <- file("a001-results.fasta", 'w')

doitall <- function(x) {
  seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- seqs[[seq_id]]
  seqv   <- seq[start:stop]
  header <- paste(sep="", ">", attr(seq, "name"), "_", start, "_", stop, "\n")
  cat(file=outf, header)
  cat(file=outf, toupper(paste(sep="", collapse="", seqv)), "\n")
}
apply(locs, 1, doitall)
close(outf)
==> a001.coords <==

chr1 2 4
chr2 4 6
chr2 3 6
==> a001.fasta <==

>chr1
ATATATACCGA

>chr2
TGGCGTATTAT
==> a001-results.fasta <==

>chr1_2_4
TAT 
>chr2_4_6
CGT 
>chr2_3_6
GCGT
ADD COMMENT
0
Entering edit mode

Hey @Aleksandr, how should the fasta headers should be? Exactly what the 'a001.coords' col1 says, right?

ADD REPLY
6
Entering edit mode
13.7 years ago

If you want to work with a plain text FASTA file, the fastaFromBed program in BEDTools will do this for you. The two inputs are your genome file in FASTA format and the coordinates in BED format. I would suggest using the unreleased version in the repository, as it uses Erik Garrison's nice FastaHack library for speed. If you already heave a .fai file from samtools, it will use that. Otherwise, it will create an index the first time, and reuse it thereafter.

fastaFromBed -fi in.fasta -bed regions.bed -fo out.regions.fa
ADD COMMENT
2
Entering edit mode
ADD COMMENT
1
Entering edit mode
13.7 years ago
Florianino ▴ 30

Hi thanks a lot guys, I'm very new in the field, I really appreciate your help, bla bla bla

I have installed bedtool and tried fastafromBED but it looks like when I ask for positions 1 to 25, it gives me 2 to 25 instead in the output. How come?

For the other solution, I had to install R. I think there may a problem with the script because I use tab delimited file for coords and there are only three fields (BED format).

Cheers

Florianino

ADD COMMENT
1
Entering edit mode

Hi Florianino. Welcome to Biostar. I'm glad you installed R - it's the front-line tool in Bioinformatics. For tab delimited cords files, simply change sep=" " to sep="\t" on the read.delim line.

ADD REPLY
0
Entering edit mode

BED format uses zero-based, half-open coordinates, so the first 25 bases of a sequence are in the range 0-25 (those bases being numbered 0 to 24).

ADD REPLY
0
Entering edit mode

This would be better posed as a new question.

ADD REPLY
0
Entering edit mode

ok thanks, well I guess that's obvious for everyone.. BEDtools looks very handy! Florianino

ADD REPLY
0
Entering edit mode

@Florianino - I don't think this is obvious at all, see the following question inspired by your post: What Are The Advantages/Disadvantages Of One-Based Vs. Zero-Based Genome Coordinate Systems

ADD REPLY
0
Entering edit mode
13.7 years ago
Florianino ▴ 30

Ok, I'll post the BED issue right now as new question. Thanks Aleksandr, i'll try that and let you know. Thank you every one. Things are much easier when people are connected Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 1905 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6