I'm a newcomer to the sequencing world. From another lab, I have a large (full human genome) mate-pair BAM file produced from the following steps:
1. Trimming of reads to 30bp for PHRED scores under 20 (software unknown).
2. Alignment against GRCh37 with BWA 0.5.8a.
3. De-duplication with GATK 1.0.4.
4. Local realignment around known indels and base score recalibration with GATK 1.0.4.
5. Picard's FixMateInformation (version unknown).
I want to realign the reads against GRCh38 using newer software; in other words, I want to undo steps 1–5, or at least 2–5.
Will SamTools bamtofastq handle this correctly? Specifically, it seems that de-duplication using an alignment against GRCh37 (step 3) permanently changed the BAM by removing reads that might be aligned differently against GRCh38. Since the only command from GATK I could find for de-duplication is MarkDuplicates, which doesn't delete any reads, I will assume this was used. Are there any other steps that would be an issue, and is bamtofasq the right way to do this?
I understand these steps algorithmically but don't know how the data in BAM format is actually altered.
Thanks!
Did they remove duplicates or just mark them?
Thanks! I updated my post.