Hi, I'm struggling with basic things,, please help me out. I want to count lines of fastq.gz files from RNA seq results in R studio. If possible please let me know the function.
Hi, I'm struggling with basic things,, please help me out. I want to count lines of fastq.gz files from RNA seq results in R studio. If possible please let me know the function.
If you are limited by memory this should work:
library(ShortRead)
file <- 'your_file.fastq.gz'
## set your stream chunk value - if you have more or less memory set the n and readerBlockSize value higher or lower
f <- FastqStreamer(file, n=100, readerBlockSize=1000)
## initialize length
totalLength <- 0
while (length(fq <- yield(f)) ) {
totalLength <- totalLength + length(fq)
}
close(f)
print(totalLength)
I don't know if this is could be considered a solution "in R" since effectively it relies on system commands:
fastq <- 'reads.fastq.gz'
n_lines <- as.integer(system(sprintf('gzip -cd %s | wc -l', fastq), intern= TRUE))
library(Biostrings)
fq <- readDNAStringSet('your_file.fastq.gz',format='FASTQ')
length(fq)
Try this to decrease memory costs:
library(ShortRead)
file <- 'your_file.fastq.gz'
## set your stream chunk value - if you have more memory set the n value higher or lower
f <- FastqStreamer(file, n=100)
## initialize length
totalLength <- 0
while (length(fq <- yield(f)) ) {
totalLength <- totalLength + length(fq)
}
close(f)
print(totalLength)
You can use function readFastq from microseq package. It will save gzipped fasta and return a tibble. Number of rows in tibble will be a number of reads in fastq file. If you need to count a lines:
(n * k) + n
where n - number of rows, k - number of column (readFastq return tibble with 3 columns)
Thank you @Hood Could you please tell me how to return the tibble using that function? I tried like (I'm not so sure if I did right)
fdta <- readFastq(fq.file) Error in fread(in.file, header = F, sep = "\t", data.table = F, quote = "") : Opened 25.67GB (27558532840 bytes) file ok but could not memory map it. This is a 64bit process. There is probably not enough contiguous virtual memory available.
it showed this error.. am I doing right?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The advantage of this solution is that you're not actually reading in the fastq file, which could have your R choking if it's very large.
Thank you @dariober and @Friederike however, it seems 'gzip' doesn't work in R
it shows this error...
try
gunzip -c
n_lines <- as.integer(system(sprintf('gunzip -c %s | wc -l', fastq), intern= TRUE))