I am looking to infer large scale somatic copy number alterations (per sample). from a set of low coverage samples.
I see a lot of noise that is impossible to discern from real copy number alterations. My attempts at mappability filtering and deduplication have not yielded any significant improvement.
There are some practical challenges with these samples:
- I have no normal reference, but the toolkit I am using has a built-in mappability, panel-of-normals and GC-correction.
- The samples are severely degraded, thus the distribution of reads is very heterogeneous. There are spikes and empty valleys that increase overall spread of read counts and may be mistaken for CNAs.
- Aside from Median depth of coverage varies between 0.5× and 2.5× per sample.
1.
This kind of data would benefit greatly from an aligner that can produce more homogeneous (less sparse coverage) results than bwa mem
does.
Are there any aligners of preference for a serious degradation scenario?
2. Could masking repeating regions improve read aligning in this context?
3.
A poster on this forum suggested to use digital normalization (tools like bbnorm
, ROCKS
and khmer
). If I understand this correctly, tools of this kind are designed with "proper" coverage in mind and running them for anything below 16× is not optimal. Does it make sense to use these tools on samples with uneven coverage and median depth down to 0.5×? Are there any parameters that can be optimized to account for this kind of data?
Please let me know if any other metrics or information needs to be added to this post to elucidate the challenges.
Sometimes getting more data is going to be the only answer. No bioinformatics magic is going to make the data usable if there is not enough of it in first place.
That said if your data is from degraded DNA then perhaps look into alignment methods used for such DNA (e.g. neanderthal data).