Load the libraries required:

Question

coverage vector file formats in R

0

Entering edit mode

3.8 years ago

wrab425 ▴ 50

I did a whole series of alignments of bam files to my genome using tools in the following R libraries in Bioconductor

library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
library(limma)
library(GenomicAlignments)

I used the following commands to create coverage vectors:

Cov_1 <- coverage(alignGR_1)
smoothed the vectors and plotted them out.:
plot(smoothCov_1_a, col="red", type="l", ylab="Number of reads", xlab="bp from range start")

All worked well.

I exported the coverage vectors so that I could analyse them with respect to sequence features but when I look inside them I did not find numbers but unicode characters. Seems like a silly mistake but I am lost as to what it might be?

ChIP-Seq vectors file_formats coverage R • 2.5k views

ADD COMMENT • link updated 3.7 years ago by Dunois ★ 2.9k • written 3.8 years ago by wrab425 ▴ 50

1

Entering edit mode

What was the function call you invoked to export the data?

ADD REPLY • link 3.8 years ago by Dunois ★ 2.9k

0

Entering edit mode

Thanks Dunois; I established that I had been working with GenomicRanges Rle object and needed to convert it to a numerical vector which using as.numeric() and after that I simply used the write() command to recover it as a simple text vector outside R.

ADD REPLY • link 3.8 years ago by wrab425 ▴ 50

0

Entering edit mode

I don't think as.numeric() would have worked had you passed it the Rle object directly. Did you get a NAs introduced by coercion warning when you tried?

I think the right way would probably be as.numeric(decode(x)) or something to that effect (where x is the Rle object). Or potentially as.numeric(as.vector(x)). I think this would depend on what is contained in the object. If you could give us an indication of that, it'd be helpful.

I suggest reading through the help documentation for Rle objects (?GenomicRanges::Rle). Specifically the From Rle to other objects subsection under the section Coercion.

ADD REPLY • link 3.8 years ago by Dunois ★ 2.9k

0

Entering edit mode

Passing the Rle object to as.numeric() worked fine with no warnings or NAs introduced by coercion . I looked inside the Rle object and it was just a list of unicode characters. The output of as.numeric() is just a list of numbers. I am happy to send you anything you would like to see; please just let me know!

ADD REPLY • link 3.8 years ago by wrab425 ▴ 50

0

Entering edit mode

That sounds a bit odd. To the best of my knowledge, as.numeric() cannot handle Unicode characters. I can't really tell what's going on. Could you share some anonymized code and dummy data perhaps? Would be nice if you could cover everything including the smoothening function you mentioned in the OP.

ADD REPLY • link 3.8 years ago by Dunois ★ 2.9k

1

Entering edit mode

Load the libraries required:

library(GenomicRanges) library(rtracklayer) library(Rsamtools) library(limma) library(GenomicAlignments)

Create a BamFileList object containing details of the BAM files we wish to analyse

dataDir <- "directory_name" bamFiles <- dir(file.path(getwd(), dataDir), pattern="*filtered.sorted.bam$", full.name=T) names(bamFiles) <- gsub(".bam","",basename(bamFiles))

bfList <- BamFileList(bamFiles)

Extract the sam header for a single bam file and examine the structure of the list using the str() function

samHeader_1 <- scanBamHeader(path(bfList["File_name"])) str(samHeader_1,max.level=2)

samHeader_2 <- scanBamHeader(path(bfList["File_name"])) str(samHeader_2,max.level=2)

Explore the information provided in the sam header.

samHeader_1[[1]]$targets

samHeader_2[[1]]$targets

Extract information about chromosome 2 coordinates to set up filtering parameters using the ScanBamParam() function.

chrdat_1 <- samHeader_1[[1]]$targets["File_name"]

chrrange_1 <- GRanges(seqnames=names(chrdat_1),ranges=IRanges(1,chrdat_1)) param_1 <- ScanBamParam(which=chrrange_1)

chrdat_2 <- samHeader_2[[1]]$targets["File_name"]

chrrange_2 <- GRanges(seqnames=names(chrdat_2),ranges=IRanges(1,chrdat_2)) param_2 <- ScanBamParam(which=chrrange_2)

Having selected an area of interest, read in read data from a single BAM file using the parameters set using the readGAlignments() function. Convert this to a GenomicRanges object and inspect the data contained within. Calculate the average read length of the selected reads (in case some reads were trimmed).

alignDat_1 <- readGAlignments(path(bfList["File_name"]),param=param_1) alignGR_1 <- granges(alignDat_1) seqlevels(alignGR_1) <- "File_name"

alignDat_2 <- readGAlignments(path(bfList["File_name"]),param=param_2) alignGR_2 <- granges(alignDat_2) seqlevels(alignGR_2) <- "File_name"

Use the coverage() function to create coverage vectors.

strand(alignGR_1) Cov_1 <- coverage(alignGR_1)

strand(alignGR2) Cov_2 <- coverage(alignGR_2)

Extract coverage values for the region of interest.

coord <- c(123301:129401)

the fllowing are normalizations

Cov_1_a<- Map("/", Cov_1, 201035.1129) Cov_2_a<- Map("/", Cov_2 , 257977.1543)

subCov_1_a <- Cov_1_a$File_name[coord] subCov_2_a <- Cov_2_a$File_name[coord]

enter code here

ADD REPLY • link 3.8 years ago by wrab425 ▴ 50

0

Entering edit mode

Thank you for sharing the code with me. I think you just pasted the code above (instead of inside) the code markdown area denoted by a `` and \``` respectively: the place where it saysenter code here` in your post right now.

So I ran some "dummy" data through your code, and it seems to work fine as far as I can tell. The data I used is from here. The coverage() function worked. I did not get any Unicode characters in the vector.

As far as I understand, coverage() produces a RleList of Rle elements (so a list of run vectors; see here). The accessor functions runLength() and runValue() could be used with an Rle object to create a data.frame of the run vector like so:

df <- data.frame(lens = runLength(x), vals = runValue(x))

df
#   lens vals
# 1 7007    0
# 2    2    1

The data.frame could then be exported to file with an appropriate file name and all that jazz. I have a small function down here that automates this for every Rle element in a RleList produced by coverage():

expcovs <- function(x, outpath = NULL){

  if(is.null(outpath)){ outpath <- getwd() }

  locrles <- names(x)
  locrles

  for(i in 1:length(locrles)){

    cur <- x[[locrles[i]]]

    locdf <- data.frame(lens = cur@lengths, vals = cur@values, stringsAsFactors = FALSE)
    locdf$samp <- locrles[i]

    outfile <- paste0(outpath, "/", locrles[i], "_coveragevector.csv")
    cat("Writing out coverage vectors for ", locrles[i], " to ", outfile, "!!\n")
    write.table(locdf, outfile, sep = ",", quote = TRUE, row.names = FALSE)
    cat("Done!!\n")

  }

}

#Example function call.
#Cov_1 has only one element, so only one file written out.
expcovs(Cov_1, outpath = "/some/path")
# Writing out coverage vectors for  Test1-ready  to  /some/path/Test1-ready_coveragevector.csv !!
# Done!!

Is this what you were hoping to do?

ADD REPLY • link 3.8 years ago by Dunois ★ 2.9k

1

Entering edit mode

Thanks Dunois, I have been away writing my paper so have not had the chance to thank you. I ran the as.numeric and it came out as a file in which the numbers were clustered in groups of five. I pulled these apart using a PERL script and got what I needed but it was messy. I know need these vectors for GEO so I shall try what you suggest. I have many files to crack through and will let you know how it goes. William

ADD REPLY • link 3.7 years ago by wrab425 ▴ 50

0

Entering edit mode

No worries, and good luck. I hope it goes well!! Looking forward to hearing that it all worked out!!

ADD REPLY • link 3.7 years ago by Dunois ★ 2.9k

0

Entering edit mode

I am having trouble getting the formatting and hope that you can see the key lines. If you need bam file then just ask but they are huge.

ADD REPLY • link 3.8 years ago by wrab425 ▴ 50

0

Entering edit mode

I looked at https://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/GenomicRanges/html/GRanges-class.html and there was no mention of a need for decode(x) under the entry for coercion.

ADD REPLY • link 3.8 years ago by wrab425 ▴ 50

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.
code_formatting