Hi,
I need to analyze down-sampled data of couple of Rna_Seq full data set (samples,paired-end,fastq).sub-sampling method should work the same for all samples. In the end I compare for example how 5% of full data differs from 10% of full and 20 and 40% as well. (a sample is :ERR188044) The final graph will depict how amount of data affects the result.
The question is : How to download the data in these four forms ? shall I firstly download the full data and then downsample ? Or I can directly download down-sampled data. how to sub-sample data to get only a few number of specific chromosomes ? how to sub-sample data to get only a percent of whole paired-end reads?
What do you suggest me to do ?
Your advice is appreciated.
Thanks.
UInless you are doing a 2-pass alignment, I'd say that reads are aligned independently. Wouldn't it then be easier/most efficient to downsample the read counts table? See for example subsample.
If you just need counts, then yes. It's not clear to me that that's exactly what's going on here, though.
thanks, Devon! could you please tell what do you mean by subsampling typically few times? i do understand that in order for it to be robust its better to do subsampling several times.. but i dont know how to understand how many times? and how to do it using seqtk? For instance, if I need to downsample 10 M PE reads to 2 M PE reads, should I subsample 500 000 PE reads from say, 4 times, and then merge together? But then I have a problem because how can I do it with seqtk it will lead to repeats cause every time I subsample from the same original file randomly.. could you please recommend anything to look into to get more ideas of what could i decide on it? thank you!
2 or 3 times per read number should be fine to produce a smooth enough curve. So if you start with 10 million reads, the produce 2 or 3 datasets each of 1, 3, 5 and 7 million reads. You can just rerun seqtk with a different seed each time, since otherwise you'll end up with the same subsampled reads again and again.