Extract DNA sequence based on a single coordinate
1
1
Entering edit mode
8.3 years ago
ThePresident ▴ 190

Hello,

I know many have asked about extracting/fetching sequences from a range of coordinates. However, I would like to extract 10bp on the left and the right side from a specific position.

Let's say I have a genomic position at 50000. I would like to extract 10bp on each side, so from 49990-50010. Any suggestions how to do it?

Thank you,

TP

extract seq R python • 4.0k views
ADD COMMENT
0
Entering edit mode

Do you need this just once or for multiple positions?

ADD REPLY
0
Entering edit mode

Yup, multiple positions (coming from a screen)

ADD REPLY
3
Entering edit mode
8.3 years ago
Brice Sarver ★ 3.8k

I'm assuming you mean from a FASTA file. The easiest way to do this is to create a BED based on what you need (remember: 0-based indexing!).

Once you're set, you'll want to use bedtools getfasta.

This is also super quick to do for multiple files or multiple sequence alignments using Biostrings in R. Something like:

library(Biostrings)
a <- readDNAStringSet("your_file.fa")
b <- subseq(a, start=start, stop=stop)
writeXStringSet(b, file="new.fa")
ADD COMMENT
0
Entering edit mode

Update: I did it with bedtools. For those who might be asking the same question, here's my procedure:

  • I manually created a minimal BED file (tab-delimited txt file) with genome name (matching fasta file header) and genomic range I want to extract
  • Run the following: bedtools getfasta -fo BED_output.txt -tab -fi REF.fasta -bed BED_input.txt
ADD REPLY
1
Entering edit mode

Although this is an easier method, one thing to take care is the co-ordinates. If you have given co-ordinates as 1 to 10 for extraction, your final sequence will have 9 nucleotides only. It will miss the first one (the same has been also shown in their example). It took me some time to figure this out and had to do complete reanalysis.

ADD REPLY
0
Entering edit mode

This is not a bug. bed files have 0-based half open intervals. This means, we start counting our position with 0 (instead of 1) and exclude the upper limit of the interval.

See also: Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

fin swimmer

ADD REPLY

Login before adding your answer.

Traffic: 1149 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6