Is it possible to use STAR for extracting info on presence or absence of polyA tail sequences for novel transcripts predicted by Cufflinks?
Is it possible to use STAR for extracting info on presence or absence of polyA tail sequences for novel transcripts predicted by Cufflinks?
STAR is just a spliced aligner. It produces SAM/BAM files. If you want to see if certain aligned reads in those files contain polyA's, then you will have isolate those reads by region (BEDTools) and grep them.
No, STAR is an aligner, it can't do that (it's used upstream of tools like cufflinks, after all). Further, since the polyadenylation sequence is typically added post transcription, polyA sequences will often not map to begin with. You might be able to predict which transcripts are polyadenylated (there are binding motifs for some of the polyAdenylation-related protein), though I don't know how accurate that is. You're probably better off just doing an experiment to answer this if you really want reliable results.
I actually thought to use STAR instead of Tophat prior to Cufflinks, as was suggested here, because from STAR paper (referenced in that post) it appears that it detects polyA in mismatches, in order to trim them, and then remap without polyA. I was hoping that there might be an option to utilize trimmed polyA data for the Cufflinks predicted novel transcripts. Something like a script that would take the reads in which polyA was detected for trimming purposes, and then check if 3' ends of the predicted novel transcripts align to those reads. I looked into polyA motifs for prediction, but would rather make use of seq data if I could find a way to make it work. thank you.
because this tool appears not
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
if I understood correctly: STAR will trim/mask polyA due to mismatch, but the other part of the read that mapped to 3' of the transcript will make it into the produced SAM/BAM file. Then, BEDTools would allow extracting the entire read (untrimmed/unmasked) from SAM/BAM file, and grep will allow checking for the presence of polyA stretches of certain length within the reads? I have almost 2k novel transcripts predicted with Cufflinks 'u' code and ORF predicted by Transdecoder, and was hoping that there might be a script that could do this analysis automatically.
The seqclean module of PASA2 appears to identify polyA sites using polyA-tail sequences identified by RNAseq. However, PASA2 requires Trinity instead of Cufflinks, and this would create issues in combining both approaches into a single paper (as there would be differences between their outputs, including the properties and number of predicted novel transcripts). I wonder if it is possible to adapt the seqclean module to STAR/Cufflinks?
I checked few of the 'u' code transcripts in the UCSC Genome Browser with all available tracks, and at least some had nothing annotated in the regions they mapped to. I purified a specific neuronal subtype that is not abundant, and had to pool RNA from multiple preps to get enough for RNAseq. If these transcripts are unique to this neuronal subtype, others who did tissue RNAseq may have discarded them as noise, since these cells would represent a very small percent in a tissue sample. So I believe many of these novel transcripts could be real, and that is why I would like to find a high-throughput way for the polyA sequence detection. Thank you
The plan you present in the first paragraph is doable with just a few lines, maybe even one line
thank you very much!