Reduce number of reads from 100M to 50M in the sample from the existing RNA-Seq fastq files
4
0
Entering edit mode
5.8 years ago

Hi,

We have generated a set of RNA-seq samples from blood tissue. These are human paired-end samples with 100M reads and with read length of 150bp. I want to know if it's possible to reduce the number of reads from 100M to 50M in the sample(s) from the existing RNA-Seq fastq files generated the facility. Please let me know most appropriate tool and methods to perform.

We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.

Thank you, Toufiq


Updated question:

Thank you all for the suggestions and advise. Let me rephrase the question once again.

We have generated a set of RNA-seq samples from blood tissue (non globin depleted). These are human paired-end samples with read length of 150bp with 100 million read depth. After the alignment against hg19 genome, the alignment range is between 84-91% for different samples and the total corresponding reads for each sample range between 19 to 28 million reads. Quantification process provided a rough estimate of read counts corresponding to the globin genes (HBA1, HBA2, HBB) between 30 - 62% which would later filtered before the normalization process. In this scenario, I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth. We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.

As per the suggestions, sub-sampling total reads from each sample provided similar results. Is total reads obtained after sequencing each sample (fastq file) and read depth are similar concepts?

RNA-Seq sequence fastq reads • 5.1k views
ADD COMMENT
2
Entering edit mode

From seqtk (also proposed below by genomax):

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq 
seqtk sample -s100 read2.fq 10000 > sub2.fq
ADD REPLY
0
Entering edit mode

Thank you. What is -s100 parameter? Is it the read length?

My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?

ADD REPLY
0
Entering edit mode

Thank you all for the suggestions and advise. Let me rephrase the question once again.

We have generated a set of RNA-seq samples from blood tissue (non globin depleted). These are human paired-end samples with read length of 150bp with 100 million read depth. After the alignment against hg19 genome, the alignment range is between 84-91% for different samples and the total corresponding reads for each sample range between 19 to 28 million reads. Quantification process provided a rough estimate of read counts corresponding to the globin genes (HBA1, HBA2, HBB) between 30 - 62% which would later filtered before the normalization process. In this scenario, I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth. We are currently in the doing technical validation to check if the RNA-seq analysis and results would be comparable or different with 100M or 50M reads including alignment, quantification, normalization and differential expression analysis.

As per the suggestions, sub-sampling total reads from each sample provided similar results. Is total reads obtained after sequencing each sample (fastq file) and read depth are similar concepts?

ADD REPLY
1
Entering edit mode

These are human paired-end samples with read length of 150bp with 100 million read depth.

and

total corresponding reads for each sample range between 19 to 28 million reads

Does that mean your total dataset had 100 M reads and there are ~5 samples with reads ranging from 19-28M? Or you actually have 100M reads for each sample. Otherwise something does not add up. Note: Word depth is not being used appropriately here. Clarification of context dependent explanation of depth can be found in this thread: What Is The Sequencing 'Depth' ?

I would like to know if we sequence the samples to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth.

Yes if you are thinking about % of globin reads. Because you would be sampling randomly the percentage of globin reads in 50M set should remain at the same relative level as 100M set.

ADD REPLY
0
Entering edit mode

Does that mean your total dataset had 100 M reads and there are ~5 samples with reads ranging from 19-28M? There are 13 samples. Each sample was sequenced for 100 million read depth. However, the fastq files generated shows varying total number of reads for each sample as mentioned below for instance,

Sample 1: 20M Sample 2: 28 M Sample 3: 25 M Sample 4: 19M . . . . . . . . . Sample 13: 22M

Or you actually have 100M reads for each sample. I do not have 100 million reads for each sample, however, each sample was sequenced for 100 million read depth by the facility and the fastq files for each sample has varying total reads. Does total reads obtained after sequencing each sample (fastq file) and read depth in general are similar or different concepts?

Yes if you are thinking about % of globin reads. Because you would be sampling randomly the percentage of globin reads in 50M set should remain at the same relative level as 100M set.

Generally, if we sequence the same set of sample again to 50 million read depth from 100 million read depth, do we get the still get similar results as obtained with 100 million read depth.

ADD REPLY
0
Entering edit mode

Does total reads obtained after sequencing each sample (fastq file) and read depth in general are similar or different concepts?

Would you mind taking a look at the "What is the sequencing depth" post from biostars that I linked above? That should help clarify depth and the context that is needed when you refer to depth. Are you thinking in these terms: A: What Is The Sequencing 'Depth' ?

Each sample was sequenced for 100 million read depth.

This does not have a logical meaning. What are you basing the calculation on?

ADD REPLY
0
Entering edit mode

Thank you. It was helpful.

Does sub-sampling of a particular unaligned fastq file and aligned sam/bam file produce similar outputs?

ADD REPLY
0
Entering edit mode

Does sub-sampling of a particular unaligned fastq file and aligned sam/bam file produce similar outputs?

Probably not. Even if you have unaligned reads in your aligned files. Whole idea of random sampling is to be just that. If you sub-sampled the original data and the aligned file with a specific seed you would get the same output but they would likely not be the same across the two file types.

If you need to do any sub-sampling you should probably do it at the fastq read level.

ADD REPLY
0
Entering edit mode

Thank you. In seqtk, what is -s100 parameter? Is it the read length?

My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?

ADD REPLY
0
Entering edit mode

Options: -s INT RNG seed [11]

It is the seed parameter. If you set it to the same number you would be guaranteed to get the same output for multiple runs.

ADD REPLY
0
Entering edit mode

My data files are compressed in the fastq.gz format. Is this supported by seqtk? Are there any commands and arguments?

No additional arguments needed for gziped files. seqtk handles both FASTA and FASTQ files which can be optionally compressed by gzip.

ADD REPLY
2
Entering edit mode
5.8 years ago
GenoMax 147k

You can use reformat.sh from BBMap suite (look at sampling options) or seqtk (sample option) from Heng Li to sub-sample data randomly.

Data quality can be sample/library dependent so I am not sure you can generalize this type of analysis if you expect to get more human blood samples.

ADD COMMENT
0
Entering edit mode
5.8 years ago
5heikki 11k

This outputs every second read from a fastq file:

cat in.fq | paste - - - - - - - - | awk 'BEGIN{FS="\t";OFS="\n"}{print $1,$2,$3,$4}' > out.fq
ADD COMMENT
0
Entering edit mode
5.8 years ago

Reads in the fastq have contained in their names the tile origin of each read. In this read, it's tile 1106

SNL154:328:CCBTGACXX:1:1106:3995:2249

You could use awk or grep to filter for reads on only half the tiles.

ADD COMMENT
0
Entering edit mode
5.8 years ago
chen ★ 2.5k

Suggest you to use fastp ( https://github.com/OpenGene/fastp ).

fastp -i R1.fq.gz -I R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz -A -Q -L -G --reads_to_process=50000000

-A -Q -L -G options are used to disable quality trimming and adapter trimming, since fastp performs quality and adapter trimming by default.

fastp has a prebuilt binary so that you can use it in a very easy way.

ADD COMMENT
0
Entering edit mode

So, I understand that fastp could be used for the downsampling and subsampling data with compressed fastq files.

ADD REPLY
0
Entering edit mode

Yes, but the downsampling is not random. fastp just uses the begining reads.

ADD REPLY

Login before adding your answer.

Traffic: 1561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6