Question

Identifying polyA tail sequences for predicted novel transcripts using STAR/Cufflinks

0

Entering edit mode

11.1 years ago

trakhtenberg ▴ 160

Is it possible to use STAR for extracting info on presence or absence of polyA tail sequences for novel transcripts predicted by Cufflinks?

RNA-Seq • 5.6k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by trakhtenberg ▴ 160

0

Entering edit mode

11.1 years ago

Devon Ryan 105k

No, STAR is an aligner, it can't do that (it's used upstream of tools like cufflinks, after all). Further, since the polyadenylation sequence is typically added post transcription, polyA sequences will often not map to begin with. You might be able to predict which transcripts are polyadenylated (there are binding motifs for some of the polyAdenylation-related protein), though I don't know how accurate that is. You're probably better off just doing an experiment to answer this if you really want reliable results.

ADD COMMENT • link 11.1 years ago by Devon Ryan 105k

0

Entering edit mode

I actually thought to use STAR instead of Tophat prior to Cufflinks, as was suggested here, because from STAR paper (referenced in that post) it appears that it detects polyA in mismatches, in order to trim them, and then remap without polyA. I was hoping that there might be an option to utilize trimmed polyA data for the Cufflinks predicted novel transcripts. Something like a script that would take the reads in which polyA was detected for trimming purposes, and then check if 3' ends of the predicted novel transcripts align to those reads. I looked into polyA motifs for prediction, but would rather make use of seq data if I could find a way to make it work. thank you.

because this tool appears not

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by trakhtenberg ▴ 160

1

Entering edit mode

STAR will soft-clip sequences, so I suppose you could use that (you'll need decently long reads, of course). You could certainly write a program to do what you want (likely using pysam and a pileup).

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by Devon Ryan 105k

0

Entering edit mode

thank you for the feedback!

ADD REPLY • link 11.1 years ago by trakhtenberg ▴ 160

Ram · Accepted Answer · 2014-08-10

2

Entering edit mode

11.1 years ago

Jeremy Leipzig 23k

STAR is just a spliced aligner. It produces SAM/BAM files. If you want to see if certain aligned reads in those files contain polyA's, then you will have isolate those reads by region (BEDTools) and grep them.

ADD COMMENT • link 11.1 years ago by Jeremy Leipzig 23k

0

Entering edit mode

if I understood correctly: STAR will trim/mask polyA due to mismatch, but the other part of the read that mapped to 3' of the transcript will make it into the produced SAM/BAM file. Then, BEDTools would allow extracting the entire read (untrimmed/unmasked) from SAM/BAM file, and grep will allow checking for the presence of polyA stretches of certain length within the reads? I have almost 2k novel transcripts predicted with Cufflinks 'u' code and ORF predicted by Transdecoder, and was hoping that there might be a script that could do this analysis automatically.

The seqclean module of PASA2 appears to identify polyA sites using polyA-tail sequences identified by RNAseq. However, PASA2 requires Trinity instead of Cufflinks, and this would create issues in combining both approaches into a single paper (as there would be differences between their outputs, including the properties and number of predicted novel transcripts). I wonder if it is possible to adapt the seqclean module to STAR/Cufflinks?

I checked few of the 'u' code transcripts in the UCSC Genome Browser with all available tracks, and at least some had nothing annotated in the regions they mapped to. I purified a specific neuronal subtype that is not abundant, and had to pool RNA from multiple preps to get enough for RNAseq. If these transcripts are unique to this neuronal subtype, others who did tissue RNAseq may have discarded them as noise, since these cells would represent a very small percent in a tissue sample. So I believe many of these novel transcripts could be real, and that is why I would like to find a high-throughput way for the polyA sequence detection. Thank you

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by trakhtenberg ▴ 160

2

Entering edit mode

The plan you present in the first paragraph is doable with just a few lines, maybe even one line

bedtools intersect -abam staralignment.bam -b cufflinksregions.bed | samtools view - | cut -f10 | grep -P 'AAA+$'