Hi all, newbie question here:
we have a hg19 reference file which is sorted in an unusual manner (see the order of contigs below). It was downloaded a few years ago, and we have no info as to its origin.
we have 40 bam files already aligned to this reference (are also sorted in this manner).
for the next pipleline steps, we would like to switch to the UCSC official hg19 reference. now, when trying to use gatk tools with the UCSC reference, it gives an error (see below).
we could try using gatk's ReorderSam to solve this problem, but we have 2 concerns:
1) Is there a way to identify where the original reference came from, and to verify it is indeed an acceptable reference? the weird sorting makes us suspicious.
things i've checked: - contig name are the same in both references - samtools flagstat is almost identical (few reads apart) for 2 bam files aligned against each reference
2) is using gatk's ReorderSam an acceptable solution for our situation? is it "good practice" to switch from one reference to another (UCSC) in the middle of a pipeline?
We are trying to decide if realigning 40 bams is necessary, and would appreciate your thoughts.
following is GATK error message- see order of old reference:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs.
Please see https://www.broadinstitute.org/gatk/guide/article?id=63 for more infor
mation. Error details: The contig order in reads and reference is not the same;
to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, c
hr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20,
chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_m
ann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg
9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl00019
3_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_ra
ndom, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chr
Un_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl00021
1, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_
gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random,
chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl0002
33, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl0002
41, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn
_gl000234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248
, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, c
hr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn
_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231, chrUn_gl000229
, chrM, chrUn_gl000226, chr18_gl000207_random]
##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr
7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, ch
r19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random
, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr
6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ss
to_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_
gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_rando
m, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204
_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, ch
r19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl00021
1, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl00021
6, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl00022
1, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl00022
6, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl00023
1, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl00023
6, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl00024
1, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl00024
6, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
If you don't have the original sequence data you could get the sequences out using
bamtofastq
and then realign the data. Trying to hunt down what reference was used may be an exercise in frustration/of questionable value.If you are (even if only a bit) uncertain about the quality or integrity of your reference, I would definitely re-align everything. In the worst case you encounter any trouble later in the project due to this reference but are held responsible for it, as you are the data analyst. Do yourself a favor and exclude these sources of trouble right away.
thanks genomax and Alexander. we actually do have the original data, but it would take time to realign since we have many samples. samtools flagstat returned same results for data aligned with the suspicious reference and the UCSC reference. Isn't it enough to prove that the suspicious reference is legit?
It really depends on what you want to do next. Anytime a reference is in question (or not of defined provenance) all downstream analysis may be in doubt, especially if someone needs to/tries to replicate your analysis. Realigning the data will take time now but you won't need to worry about questions down the road.