Raw pileup file some times before indel (*) have multiple insert (+4caaa) or deletion (-13GGCGCGCGTGCGC) strings in the read colum (column9)
how to get rid of them? Any quick awk or perl suggestions.
2L 650 t T 48 0 34 7 ..,.,,+4caaa,+4caaa C?CCA=?
2L 650 * */* 9 0 34 7 * +CAAA 5 2 0 0 0
2L 654 A A 48 0 34 7 .$.$,.,+1g,, DBC?CCA
2L 654 * */* 19 0 34 7 * +G 6 1 0 0 0
2L 2332 g G 60 0 14 33 .,...,,.-13GGCGCGCGTGCGC.,,A,,A,,A,..........aa.. DCBBBBDBCCCCBCCCDCBDCBCCCACCBABCC
2L 2332 * */* 61 0 14 33 * -ggcgcgcgtgcgc 32 1 0 0 0
2L 3334 a A 163 0 15 49 ..$,..,,,t,.T,,,..,,,,T,-7attattt,,-7attattt,,,,,,....,,......,.,,.. BBCA>BCCCC:CCCC>ACCCCBCCCCBCCCCCCDCCCCCCCCBCBCDCC
2L 3334 * */-attattt 27 27 15 49 * -attattt 47 2 0 0 0
2L 3928 c C 32 0 0 11 ,,-4tctt,,.-4TCTT...-4TCTT.^!.^!, CCC8CCCCBCA
2L 3928 * */-tctt 157 157 0 11 * -tctt 8 3 0 0 0
Column 9 is representative nucleotides of read bases. where as these extra (
+4caaa
and-13GGCGCGCGTGCGC
) are insert or delete (given in next line with indel *) i need to remove them alone not other nucleotides in that column.for example
Here read depth is 7 (column8) so there should be 7 letters in column 9 but has more because of these extra (
+4caaa
twice) which only I need to remove.base qualities are represented in column10
if there is an potential SNP then coloumn 9 also will have a t g c A T G C. apart from reference read (.,)
Thanks for pointing my error. However, as far as I understand a few short read were poorly aligned vs the reference genome. But at the end, the pileup algorithm decided that the mutation was a simple substitution. You can use pileup/'tview' to visualize the alignment at this position.
thanks, I am trying to get a concensus sequence from the raw pileup based on nucleotide distribution at each position (not based on quality) and these were cause errors in my counts.