I need to read point mutations from the 1000Genome data set (total of 84,801,880 rows), and check for overlap with another set of chromosome ranges using the GenomicRanges Package in R.
To do so I usually run the following code:
> x0 = read.csv ("NameOfFile1.csv") #read data from the first file (1000Genome in this case)
> x1 = read.csv ("NameOfFile2.csv") #read data from the second file
> library(GenomicRanges)
> gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
> gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
> hits = findOverlaps(gr0, gr1)
> hits
However, the first file is too big, and it's not possible to convert it to CSV. I tried converting it to txt file, but still I cannot read it in R. I used the following command:
> x0 = read.table ("NameofFile1.txt",header=FALSE,sep=",",stringsAsFactors = FALSE,quote = "")
Error: cannot allocate vector of size 1000.0 Mb
Is there any other way to check for overlaps between the two files? Your help is highly appreciated!
There are better ways than R to do this as others have pointed out. For R solution, you may try
fread()
fromdata.table
package that is blazing fast for reading large files. It tries to guess the delimiter and header automatically. It will give you an object of classdata.table
, which is very similar todata.frame
though quirky sometimes. You may set the argumentdata.table = F
to get the usualdataframe
object after the read is complete.I tried fread( ), it's much faster than read.csv. Thanks for suggesting that. I would like to use it with the same code I wrote above to read my file, but first I need to add headers to a BED file. Do you have an idea how I can do this?
try
header=F
explicitly in fread, and then add the headers in your code.Use BEDOPS
bedops
orbedmap
to retrieve overlaps or associations between genomic intervals. It will scale to whole-genome scale datasets.This is very easy to use. However, it requires that both files have genomic intervals. In my case, one of the files is the 1000Genome (point mutations=chr# and position; not an interval), and the other file has genomic intervals. Is there a way around to use bedops in this case?
Use
awk
to preprocess the 1000Genomes data into intervals. For examples, if the first three columns are chromosome, position, and ID, then you can do something like this:Then use
pointMutations.bed
with your other intervals file withbedops
orbedmap
.Thanks! do u know how I can add a header to the file?
To which file? Headers get ignored (removed) when doing set operations, in any case. Is there a reason you need headers?
I used the command you suggested to preprocess the 1000Genomes data into intervals. So the file I have now has 3 columns, and I want to add the following header: (chr \t start \t stop) and then use the GenomicRanges package (use the code provided in my original question). This would be the perfect way to achieve what I want.
If your input is VCF, then you can skip
awk
and usevcf2bed
(also in the BEDOPS kit):Or if you don't want an intermediate file:
Or likewise for
bedmap
:The
<(vcf2bed < pointMutations.vcf)
part is what is called a process substitution. It creates a stream of BED-formatted intervals on the fly for consumption bybedops
orbedmap
, as if it was a regular file.Do you have a 32 bit machine or a 64 bit machine?
I have a 64 bit machine
Have a look at all the
import
commands of the rtracklayer package.