R: How to loop to extract all sequences within specific positions?
0
0
Entering edit mode
7.9 years ago
francois ▴ 80

In R Studio: I want to extract all sequences from a chromosome within -2000 and +500 of the TSS.

I have already computed the start/end positions of all the sequences I want (let's only consider the plus strand for now):

promoters_plus <- data.frame (chr = 'chr21', prom_start = plus$start - 2000, prom_end = plus$start + 500, strand = '+', id = plus$id)

It am able to extract specific sequences manually like this (for example with the 5th row):

start = promoters_plus$prom_start [5]
stop = promoters_plus$prom_end [5]
subseq (chr21, start, stop)

But I do not know how to correctly loop to get all of them (ideally in a dataframe)

I have tried naively doing this:

for (i in 1:nrow (promoters_plus)) {
  start = promoters_plus$prom_start[i]
 stop = promoters_plus$prom_end[i]
  seq[i] = subseq (chr21, start, stop)
}

But it returns this error:

Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) : 
  subscript contains NAs or out-of-bounds indices

Can you help?

R sequence • 3.2k views
ADD COMMENT
0
Entering edit mode

What is chr21 ? Is it a GRanges object ?

ADD REPLY
0
Entering edit mode

No, a DNAString object (the sequence of chromsome 21)

ADD REPLY
0
Entering edit mode

This would be so easy using bedtools getfasta, so if you don't absolutely want to use R for it...

ADD REPLY
0
Entering edit mode

Hi,

I'm facing a similar issue. Were you able to resolve the error?

ADD REPLY
0
Entering edit mode

You could use samtools faidx-indexed FASTA files with a BED-to-FASTA script. For an example, see: A: Searching promoter consensus in a gene

ADD REPLY

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6