I downloaded mapped SAM/BAM files from modEncode CAGE-seq data.
I looked at SAM files and observed some inconsistency on the way how the first base (transcription start sites) is called. When there is a mismatch on first base, by either "N" or due to insertion of "G" on first base, that will shift TSS by one base. I have shown a example of where transcription start site (TSS) is correct and another example where TSS has shifted by one base due to mismatch on first base.
Correct mapping
TSS is 1450200
chr3R 1450200 1450227 HWUSI-EAS1720_0021_FC63A8AAAXX:2:4:4359:14455#0/1 0 + 0 27M * 0 0 CTTTCCGTGCGGTTCGTAAAAATGACT caffffcaffffcffffcfdfff_efd PQ:i:16
chr3R 1450200 1450227 HWUSI-EAS1720_0021_FC63A8AAAXX:2:6:8037:15660#0/1 0 + 0 27M * 0 0 CTTTCCGTGCGGTTCGTAAAAATGACT hhhhhhhhhhfhhhhfeffhhgfdehg PQ:i:19
Incorrect TSS due to mismatch on first base
TSS is 1450199 instead of 1450200
Here,either "N" is inserted on first base or "G" added by CAGE protocol. Both of these result in TSS being different by one nucleotide.
chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:69:12174:14031#0/1 0 + 0 27M * 0 0 NCTTTCCGTGCGGTTCGTAAAAATGAC Geeedefadffffdfffdffffadfef PQ:i:1
chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:84:2284:6722#0/1 0 + 0 27M * 0 0 NCTTTCCGTGCGGTTCGTAAAAATGAC F]b``bffcfcggcggfd__febbbBB PQ:i:0
chr3R 1450199 1450226 HWUSI-EAS1720_0021_FC63A8AAAXX:2:100:6796:15301#0/1 0 + 0 27M * 0 0 GCTTTCCGTGCGGTTCGTAAAAATGAC Qfffcfffdffffbfffdccadd^Wb` PQ:i:0
How can i correct these in my SAM file ? Basically if there is a mismatch on first base, TSS info should be corrected. So on this case, if "N" or "G" is clipped, it's TSS should be 1450200.
I looked at CIGAR information, but it appears to be same "27M" on all. Any help is appreciated. Thank you !!!
It sounds like the simplest approach would be to re-map using local alignments.
Could you please elaborate it. What you meant by local alignment ? How is that different from mapping by bwa/bowtie2 ? I am thinking of remapping using bowtie2, but differences in first base due to mismatch will again show up, isn't it.
Local alignment: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment
What is the reference nucleotide at position 1450199? To me, it looks like a proper alignment with all matches. What makes you think the first position is a mismatch with N or G?
Maybe you can add correction by looking at the distribution of 5' ends and inferring the most prevalent position or a narrow region? Adding info about N positions to that correction is another approach.
Also, your sequence around TSS (CTTTCCGG) looks strange to me. What are the species and gene you are studying?
Reference nucleotide at 1450199 is also C. The insertion of "G" at first base is known bias of CAGE (though it does not occur in all reads), and it is filtered out to get correct information about TSS. It seems modEncode did not correct this.
TSS (CTTTCCGG), is perfectly valid. The first nucleotide (TSS) of Ribosomal protein genes is anchored by "C", and surrounded by pyrimidine sequences, which is known for long time. This was an example of ribosomal protein from drosophila melaogaster.