Hi
I am dealing with whole genome data (Illumina, paired end , 2 X 150bp) for some bacterial species. The required data was ~ 3Gb, however for 1 out of 10 samples, we have got way too much data i.e. around 15 Gb. i don't know the wetlab part, how it happened at the first place. What I want to do now is to scale down this 15 Gb to 3Gb without loosing information.
I have thought upon various ways as mentioned below, however, I need suggestions
1. Random selection
shuffling and randomly selecting (subsetting to 3 gb) the reads ? However, that could be a bad idea as I may (or may not!) end up with reads corresponding to few specific regions. That way I loose information (coverage?)
2. Removing duplicates
Removing reads duplicates using sequence match strategy. However, again I may end loosing a good coverage across regions which could hamper the genome assembly process.
3. Mapping reads to reference and extract the mapped reads
Already tried this method, howeever, 95 % of the reads maps to the genome. Extracting those reads would not help reduce the data.
You may ask, why I want to reduce the data size. I want to maintain the integrity across all the samples i.e. I want the data to be around 3 gb for all samples.
I am curious as to why you think #1 would not work. Your data file should have no order in it and so the end result should be random.
With
reformat.sh
from BBMap you have plenty of sampling parameters to work with:Sampling parameters: