But it cannot be so simple: if two reads have been mapped on two distinct chromosomes, I'm afraid some operations could lose some informations about the pair. So I suppose, I should create one extra bam file to save those pairs
In the following operations what are the places where we can safely work on a given chromosome:
MarkDuplicates
GATK: Indel Realignment
GATK recalibration
ValidateSamFile
FixMateInformation
do you have any experience with splitting the bams ? is it worth it ?
I look at it this way: if the pairs are mapped todifferent chromosomes then that pair would not be useful for many of the analyses anyhow. A behavior like that is likely due to either errors or some sort of structural variation in the genome (or combination of both) - but if the study is not designed to interpret the structural variants then losing some of them may not be relevant.
I think the indel Realignment is relatively "local" and so should be fine at chromosome per chromosome. Recalibration need a lot of reads to estimate the error rate, but if you have high coverage data for the whole chromosome, should be plenty. However, the file with all reads mapping in different chromosome might be not so representative and/or being enriched of "mismatches". Most of those pairs are chimeric artifacts.
I look at it this way: if the pairs are mapped todifferent chromosomes then that pair would not be useful for many of the analyses anyhow. A behavior like that is likely due to either errors or some sort of structural variation in the genome (or combination of both) - but if the study is not designed to interpret the structural variants then losing some of them may not be relevant.
Can you clarify why you want to split the bams? Is I/O your limiting factor when running parallel analysis?
No, I'm just thinking about how I may improve the speed of analysis for our new cluster and if it has already been done by someone.