how to retrieve the nucleotide/base in a certain position using any programming language like R
3
2
Entering edit mode
8.9 years ago
xrao ▴ 40

Hello,

I have a long list of genome positions like below, and I would like to extract the corresponding nucleotide/base (A, T, G, or C) in those positions using R or bioconductor or other tools. I know that using UCSC genome browser can do that, but I cannot do that individually for each position. If you have a good way to achieve the goal, please let me know. Many thanks!

chr start
1  114242392
1  114242
2  7485484
..
..
..
..
..
..
..

Thousands of lines like the above

Thanks

gene genome • 7.7k views
ADD COMMENT
0
Entering edit mode

very helpful. Thank you both so much!

ADD REPLY
4
Entering edit mode
8.9 years ago
komal.rathi ★ 4.1k

I think there is a shorter way of doing this but I use the below:

library(Rsamtools)
library(BSgenome)

# read data
dat <- read.table(text = 'chr start
                  chr1  114242
                  chr1  114242
                  chr2  7485484', header=T,stringsAsFactors=F)

# if you have a genome fasta file you can import it using FaFile
# the format of chromosomes should match that in the fasta file
# i.e. 1, 2, 3 in both fasta and dat or chr1, chr2, chr3 in both
fasta_file <- FaFile(file='hg19.fasta')
gr1 <- GRanges(dat$chr,IRanges(start=as.numeric(dat$start), end=as.numeric(dat$start)))
refbase <- getSeq(fasta_file, gr1)
refbase <- as.data.frame(refbase)$x
dat$REF <- refbase
ADD COMMENT
3
Entering edit mode
8.9 years ago

Something like:

samtools faidx hg19.fasta "1:20000-20000"
samtools faidx hg19.fasta `cat test.txt | grep -v "^chr" | awk '{ print $1":"$2"-"$2 }'`

I don't know if it works with stdin. You can explore more.

ADD COMMENT
1
Entering edit mode
8.9 years ago
h.mon 35k

See GenomicRanges. Its documentation is really helpful.

ADD COMMENT

Login before adding your answer.

Traffic: 1974 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6