Hello all, I have a funny issue with bam files where the 4th character in a lot of my reads shows up as a ".". I've never seen this before but its running havoc with my scripts. Does anyone know what causes this or what it means? Below is an example
example:
@DHT4KXP1:3:1101:2235:2028#0/1
GAA.TACTGCCAAGTCATCCGTGTCATTGCCCACACCCAGATGCGCCTGCTTCCTCTGCGCCAGAAGAAGGCCCACCTGATGGAGATCCAGGTGAACGGAG
+DHT4KXP1:3:1101:2235:2028#0/1
_a_BS\ccgggegihhgfiiighfghiihhhhhiiiihiihifhiiiiihihhihihhh[dgeeeebddd_aacc_acccccbcccccccccbbccccacc
You might want to get a geiger counter ;). Just to be sure, it is exclusive to the fourth position of the read? What is the provenance of the data?
Some versions of the SOLiD sequencer used to put in dots into the colorspace sequence whenever the quality was too low and was unable to call a color. Used to break all kinds of tools.
I was thinking this as well, except the Q = 33 assuming Sanger scaling. Odd.
It is not just SOLiD data, I used to see this in Illumina qseq files a few years ago (when read lengths were at 75-76 bp). This was very frustrating because most tools would just die assuming it was improperly formatted data, especially with these dots at the beginning of the sequence. My assumption was that it was just bases that could not be called so I trimmed them.