I have an interesting case I would like to share with you
From our NGS bioinformatics pipeline we got the following 4 variants (top of following image) and then doing sanger sequencing we got one variants
DNA from blood
pipeline BWA-men and HaplotypeCaller
I have a couple of questions, first, why the first variant is unphase if it is in the same read of many read sequenced enter image description here Second, I understand the limitation of short-read technology but these delins is small and it is coveraged for many reads so why this significant different between both approaches?
Post the sequences as text so that people can align them.
Your alignments also look a little weird. Sometimes a mismatch is represented as a gap symbol. Why is that?
In general, though, this is a problem with low information repetitive regions that have variation in them. It is a well-known problem.
What helps here is if you manually run the alignments and see how and why the math resolves them a certain way.
I would do, but I don't want to type up the sequences
Thanks for your comment.
This is one read with the three last variants of the table above
Mapping = Primary @ MAPQ 60 Reference span = chr15:48,740,946-48,741,090 (-) = 145bp Cigar = 43M3I14M3I88M
Clipping = None
Mate is mapped = yes Mate start = chr15:48740924 (+) Insert size = -166 Second in pair
Pair orientation = F1R2
MC = 64M3I14M3I67M NM = 8 AS = 117 XS = 20
Hidden tags: MD, RG
Location = chr15:48,741,010 Base = A @ QV 40
Alignment start position = chr15:48740946 ATAATAATTGCATACTTACCCAAGCACATGGTTTGGTCATCATTTGTTGTTTTAAAACAAATGATGTGGCAAAGGCAATAAAAGCTTCCAACTGTGTCAATGCACTGCCCATGACTGCATATATTGGGGATTTCTTGACATTCATTACGAT
and this is the reference
tttattttgt atatagcaaa aatactacta aaagacttag tattaaattt 48740895 tatccatatt tagaatcaaa tgaagctttc aacagcatat gaaaaaaata 48740945 ATAATAATTG CATACTTACC CAAGCACATG GTTTGGTCAT CATTTGTTTT 48740995 AAAACcagTG TGGCAAAGGC AATAAAAGCT TCCAACTGTG TCAATGCACT 48741045 GCCCATGACT GCATATATTG GGGATTTCTT GACATTCATT ACGATctgta 48741095 aataagaagc atcttaagtg agaacttaga agacaaaata taattgaata 48741145 acttacttct agctatcatt ctcaggagta atcctagctc taaac.
If this is not what you asked for or need something else, please let me know.
The sequences are as follows:
Personally it looks to my eye like some of the reads are soft-clipped. HaplotypeCaller still uses soft-clipped bases in graph-based local assembly, but perhaps your caller did not?
the reference should not be shorter than the longest alignment -
you should include the reference from the leftmost to the rightmost (and probably 10 more bases before and after as well) otherwise we can't recreate the alignments.
Sorry Istavan, I can only provide the sequences in the OP's post. Probably this is the OP's responsibility.
I have checked and there are several reads with soft-clipped.