Hi, I recently complicated a SAGE based thesis. There are many spurious tagging events happening in a SAGE. This is a bit taken out from my paper.
Matching DGEs correctly to the ESTs in 3’ untranslated regions from their origination presents a challenge. One problem was that 20 bp sequences can be found throughout the assembly, not just in the region from which they were initially created, but also in coding sequences and other gene regions. Tags not in the ends of 3’ UTRs were considered artifacts of chance. These spurious tags needed to be removed from the analysis. A worst-case example of spurious tagging is presented in Figure 5. Since actual tags are theoretically found only in the 3’ UTR region of the mRNA sequences, we solved this problem by removing non-3’ UTR sequences from the GHA. We wrote a script that isolates the 3’ UTR of the sequences in the GHA based on the ORF-Predictor algorithm (Min 2008) and then removes the ORFs and 5’ UTRs. The script then proceeds with matching the 20 bp DGE tags to the 3’ UTRs. If more than one tag sequence was found in a certain UTR, the program would output only the tag closest to the end of the sequence, which was also theoretically consistent with the mode of origin of the tags. No identical tags were found within the 3’ UTRs, giving us the confidence that the ultimate tag was bone fide. The very low combined probability of a specific spurious tagging and the structural constraints co-occurring sets the error rate for tagging. The algorithm ran for 48 hours on a laptop CPU for each DGE library. It effectively combined the two different technologies to create an analysis of the expression levels present in the hop transcriptome.
THE FIG DOSENT COPY.... looked somthing like this, cnt fix line breaks this is supposed to be one fluid sequence
ATGGCTACGTAGCTCGATCGTACGTC GATCATCGTAGCTAGCTGACGCGG ATTCGTATGCGATCTGCATCGATCGTAGTCGTCATCGTAC GTATCGATCGGATCTGTACTATCGTGCATG GATCGTACGTAGCTTAGGTCTAGTGCT TAGCTGATGCTGATGTA
Figure 5 Example of Spurious Tagging in 3' UTR. The 20 bp sequences marked with ital are examples of spurious tags (Spurious). The sequence bold represents an actual DGE tag.
the script can be found in this site.
i extracted the UTRs with a standard ORF finded then ran this Python script to isolate the tags that where "real" this is the only whay i have found to do this,
feel free to contact me for help, also if you use this PLEASE site
I don't think alternative transcripts are enough to solve the problem. According to an Ensembl.org genbank file I'm looking at, one of the genes with 3 different tags has only two different transcripts.
could it be sequencing errors?
i suppose... although I shouldn't like to think that that kind of error would account for such a large percentage of mapped tags. I'm going to do a test and see exactly what percentage of tags come from multiply-mapped genes...