I have both DNA-seq and bisulfite-seq (BS-seq) data for several inbred genomes of a crop plant. I aligned the DNA to the reference genome and called high quality SNPs. I then took those SNPs and used them to modify the reference, I did this recursively a few times. I ended up with customized references for each inbred genome that were position identical but had divergent sequences. This was done in the hope of increasing alignment rates of BS-seq data and reducing false methylation calls.
The challenge comes with the hybrid of my inbreds. I have two genomes for each and had to align the same BS-seq reads to both. I now need to figure out which aligned reads likely come from each genome. But I am a little stumped as how to compare and then filter two bam files. I want to identify those reads that mapping better in one of the genomes and then remove it from the other's alignment. Are there any tools out there that my googling may have missed?
Does bamUtil diff do this? I can't tell from reading it.
As of now, BBMap is not capable of handling all bisulfite-altered reads. However, the package does contain a neat tool bbsplit.sh) which will take an input fastq of reads and output multiple fastqs, based on which reference fasta they best match. Seal does the same thing, but uses kmer-matching rather than alignment, so it has lower sensitivity.
Hey, thanks for your answer I will look into those functions but I fear they are not what I need. I'd strongly prefer not to align again if I could help it, these are very large fastq files. Plus all the information that is needed is present in the alignments I already have. I don't think what I want to do is a difficult problem, just an odd one, unfortunately.