Hi,
We have generated a set of RNA-seq samples from blood tissue. These are human paired-end samples with 100M reads and with read length of 150bp. I want to know if it's possible to reduce the number of reads from 100M to 50M in the sample(s) from the existing RNA-Seq fastq files generated the facility. Please let me know most appropriate tool and methods to perform.
We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.
Thank you, Toufiq
Updated question:
Thank you all for the suggestions and advise. Let me rephrase the question once again.
We have generated a set of RNA-seq samples from blood tissue (non globin depleted). These are human paired-end samples with read length of 150bp with 100 million read depth. After the alignment against hg19 genome, the alignment range is between 84-91% for different samples and the total corresponding reads for each sample range between 19 to 28 million reads. Quantification process provided a rough estimate of read counts corresponding to the globin genes (HBA1, HBA2, HBB) between 30 - 62% which would later filtered before the normalization process. In this scenario, I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth. We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.
As per the suggestions, sub-sampling total reads from each sample provided similar results. Is total reads obtained after sequencing each sample (fastq file) and read depth are similar concepts?
From seqtk (also proposed below by genomax):
Thank you. What is -s100 parameter? Is it the read length?
My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?
Thank you all for the suggestions and advise. Let me rephrase the question once again.
We have generated a set of RNA-seq samples from blood tissue (non globin depleted). These are human paired-end samples with read length of 150bp with 100 million read depth. After the alignment against hg19 genome, the alignment range is between 84-91% for different samples and the total corresponding reads for each sample range between 19 to 28 million reads. Quantification process provided a rough estimate of read counts corresponding to the globin genes (HBA1, HBA2, HBB) between 30 - 62% which would later filtered before the normalization process. In this scenario, I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth. We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.
As per the suggestions, sub-sampling total reads from each sample provided similar results. Is total reads obtained after sequencing each sample (fastq file) and read depth are similar concepts?
and
Does that mean your total dataset had 100 M reads and there are ~5 samples with reads ranging from 19-28M? Or you actually have 100M reads for each sample. Otherwise something does not add up. Note: Word
depth
is not being used appropriately here. Clarification of context dependent explanation ofdepth
can be found in this thread: What Is The Sequencing 'Depth' ?Yes if you are thinking about % of globin reads. Because you would be sampling randomly the percentage of globin reads in 50M set should remain at the same relative level as 100M set.
Does that mean your total dataset had 100 M reads and there are ~5 samples with reads ranging from 19-28M? There are 13 samples. Each sample was sequenced for 100 million read depth. However, the fastq files generated shows varying total number of reads for each sample as mentioned below for instance,
Sample 1: 20M Sample 2: 28 M Sample 3: 25 M Sample 4: 19M . . . . . . . . . Sample 13: 22M
Or you actually have 100M reads for each sample. I do not have 100 million reads for each sample, however, each sample was sequenced for 100 million read depth by the facility and the fastq files for each sample has varying total reads. Does total reads obtained after sequencing each sample (fastq file) and read depth in general are similar or different concepts?
Yes if you are thinking about % of globin reads. Because you would be sampling randomly the percentage of globin reads in 50M set should remain at the same relative level as 100M set.
Generally, if we sequence the same set of sample again to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth.
Would you mind taking a look at the "What is the sequencing depth" post from biostars that I linked above? That should help clarify
depth
and the context that is needed when you refer todepth
. Are you thinking in these terms: A: What Is The Sequencing 'Depth' ?This does not have a logical meaning. What are you basing the calculation on?
Thank you. It was helpful.
Does sub-sampling of a particular unaligned fastq file and aligned sam/bam file produce similar outputs?
Probably not. Even if you have unaligned reads in your aligned files. Whole idea of
random
sampling is to be just that. If you sub-sampled the original data and the aligned file with a specific seed you would get the same output but they would likely not be the same across the two file types.If you need to do any sub-sampling you should probably do it at the fastq read level.
Thank you. In seqtk, what is -s100 parameter? Is it the read length?
My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?
It is the
seed
parameter. If you set it to the same number you would be guaranteed to get the same output for multiple runs.No additional arguments needed for gziped files. seqtk handles both FASTA and FASTQ files which can be optionally compressed by gzip.