If the reference genome is very big (like for plant species), we'd like to first split the ref into smaller chunks, for example, chromosome by chromosome. Then we could map fastq to each chromosome ref separately, then merge together.
Then my worry is, this would totally change the alignment scenario compared to running against one complete genome. One read could potentially map MANY TIMES. For example, one unique read coming from chr1, will definitely map to chr1 with highest mapping score, when complete genome used as reference. But when we try to map against each chromosome reference, this same read could map to many different similar ref sequences, which bring many false positives.
So after we map to separate chromosome reference, then merge together, do we have any tools to re-calculate the mapping score? Maybe dedup tools?
But to me, dedup usually means, we find the mapping with same sequence + start + end + orientation, and remove potential PCR duplicates. So is it possible to have another type of "dedup", that is to only to retain the best mapping for one read, removing other lower-score mapping?
thx
Mapping against a reduced reference is always going to cause problems.
bwa
should be able to handle large genomes without having to split them.