Way too much data generated for 1 WGS sample
3
0
Entering edit mode
6.8 years ago

Hi

I am dealing with whole genome data (Illumina, paired end , 2 X 150bp) for some bacterial species. The required data was ~ 3Gb, however for 1 out of 10 samples, we have got way too much data i.e. around 15 Gb. i don't know the wetlab part, how it happened at the first place. What I want to do now is to scale down this 15 Gb to 3Gb without loosing information.

I have thought upon various ways as mentioned below, however, I need suggestions

1. Random selection

shuffling and randomly selecting (subsetting to 3 gb) the reads ? However, that could be a bad idea as I may (or may not!) end up with reads corresponding to few specific regions. That way I loose information (coverage?)

2. Removing duplicates

Removing reads duplicates using sequence match strategy. However, again I may end loosing a good coverage across regions which could hamper the genome assembly process.

3. Mapping reads to reference and extract the mapped reads

Already tried this method, howeever, 95 % of the reads maps to the genome. Extracting those reads would not help reduce the data.

You may ask, why I want to reduce the data size. I want to maintain the integrity across all the samples i.e. I want the data to be around 3 gb for all samples.

data ngs wgs bacteria • 2.0k views
ADD COMMENT
1
Entering edit mode

I am curious as to why you think #1 would not work. Your data file should have no order in it and so the end result should be random.

With reformat.sh from BBMap you have plenty of sampling parameters to work with:

Sampling parameters:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
ADD REPLY
4
Entering edit mode
6.8 years ago
GenoMax 148k

You can use bbnorm.sh from BBMap suite to do read normalization.

ADD COMMENT
0
Entering edit mode

that sounds like a fantastic idea. I want to understand how bbnorm works. Although, there is some information on the offical page, I am still confused. Can you point me to any online explanation for the same?

ADD REPLY
0
Entering edit mode

+ 1 for that. Worked wonders!

ADD REPLY
1
Entering edit mode
6.8 years ago

What you are looking for is 'computational normalization'. These normally do some form of k-mer counting, iterate over all reads, and discard all reads which consist of k-mers that were seen n times.

There are several software packages to do this, I've made good experiences with khmer's normalize-by-median.py: http://khmer.readthedocs.io/en/latest/user/guide.html

I believe some assemblers use the copy number to infer how many copies of a region there should be (not really a problem with bacterial genomes I think?!?) so be careful.

Alternatively, ABYSS v2.0 has a similar inbuilt function also using a Bloom filter, see abyss/konnector2 : Usage Scenario and https://github.com/bcgsc/abyss#assembling-using-a-bloom-filter-de-bruijn-graph

ADD COMMENT
0
Entering edit mode

thanks pal! Let me give it a try!!

ADD REPLY
0
Entering edit mode
6.8 years ago
mkulecka ▴ 380

Maybe you should try combination of approaches 1 and 3 - allign reads to the reference and then subsample the bam file (for example with samtools view -s option).

ADD COMMENT

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6