Hi All,
I'm looking for advice and also to know what people usually do to account for differences between different sequenced experiments. For example if you have 2 ChIP-seq/RNA-seq/... experiments, one with 60M reads and the other with 40M reads, how do you obtain:
- BAM/SAM/BED file with the same number of reads?
- A normalized WIGGLE/BEDGRAPH/BIGWIG signal track (i.e. RPM units)?
For the first question I searched a bit and found that "samtools view" have the option "-s" which you could indicate the fraction of templates to subsample from the raw mapped file, like this:
samtools view -bh -s 0.66 bam.file > subsampled.bam.file
For the second question I was thinking to take a wiggle file and divide values at each bp by total number of reads and then multiply by 1e6, like this:
awk '{if($1 !~ /^fixed/){OFS="\t"; printf "%0.2f\n",($1/60000000)*1e6}else{print $0}}' raw.fixedStep.wig.file > normalized.wiggle.file
Do you think this methods are good to account for read differences and to normalize signal tracks?
Thanks!
For #1, why do you need that? It would be easier to use a library normalization and then weight samples accordingly in the followup analysis (this is how most RNAseq analysis packages work).