Hello.
In an attempt to filter out DNA contamination from an RNA-seq experiment, I am looking for a way to extract ONLY aligned reads that contain Annotated splices.
While of course this will eliminate much of the data, the idea is that the only aligned reads that we can be absolutely sure originated from mRNA rather than DNA will contain an annotated splice.
I understand that STAR produces the splice output 'SJ.out.tab'- while this is somewhat useful, I need a way to find the read ID for each entry here.
In short, I would like the read ID for each read that contributes to the 'Number of splices: Annotated (sjdb) | ' field within 'Log.final.out'.
Thanks in advance!!
Cheers, Yale
Thank you very much!
Please let me know if I am interpreting this incorrectly- but I believe this will provide a list of all annotated and unannotated splices.
I am really only interested in the annotated ones- and I'm not sure how to identify those that STAR counts as annotated.
Thank you very much for your time!
-Yale
I'm afraid, as far as I am aware there is no easy way to do this for a bam file that has both in.
You could write a custom script to do this, for example using
pysam
.pysam
that allows you to return tuples of reference co-ordinates for alinged parts of the read.get_aligned_blocks()
or something - you would need to go through this. You be looking for cases where one block ended on an intron start position, and the next block ended on an intron end position.An alternative would be to run STAR telling it only to output spliced reads if they match a known junction. I can't remember what the switch is called, but it does exist. Then run it through the filter in my first answer to isolate the spliced reads.
Great, thanks for the direction.
Have a nice day!
I don't get, why you are filtering reads that start with @ (therefore are header) here? Could you please clarify that?
You need to include header lines in the output so that it can be converted back to BAM. if you just filtered lines that had an N in the cigar string, you would have no header, and wouldn't be able to convert back to valid bam.