Question

How to extract aligned reads that contain splice junctions from STAR

2

Entering edit mode

4.8 years ago

ysantos101 ▴ 20

Hello.

In an attempt to filter out DNA contamination from an RNA-seq experiment, I am looking for a way to extract ONLY aligned reads that contain Annotated splices.

While of course this will eliminate much of the data, the idea is that the only aligned reads that we can be absolutely sure originated from mRNA rather than DNA will contain an annotated splice.

I understand that STAR produces the splice output 'SJ.out.tab'- while this is somewhat useful, I need a way to find the read ID for each entry here.

In short, I would like the read ID for each read that contributes to the 'Number of splices: Annotated (sjdb) | ' field within 'Log.final.out'.

Thanks in advance!!

Cheers, Yale

RNA-Seq STAR Splice Site Splice Junctions • 4.6k views

ADD COMMENT • link updated 23 months ago by i.sudbery 20k • written 4.8 years ago by ysantos101 ▴ 20

score 2 · Answer 1 · 2020-02-23

2

Entering edit mode

4.8 years ago

i.sudbery 20k

You can find spliced reads by looking a the CIGAR string - reads with 'N' in them are spliced.

So you can do something like:

samtools view -h my_reads.bam | awk '$0 ~ /^@/ || $5 ~ /N/' | samtools view -b > filtered.bam

The first view statement converts the bam to sam and includes the header (-h). The awk statement fileters reads that start with @ (therefore are header), or contain N in the 5th column (the CIGAR string), then the second view statement converts back to bam.

ADD COMMENT • link 4.8 years ago by i.sudbery 20k

0

Entering edit mode

Thank you very much!

Please let me know if I am interpreting this incorrectly- but I believe this will provide a list of all annotated and unannotated splices.

I am really only interested in the annotated ones- and I'm not sure how to identify those that STAR counts as annotated.

Thank you very much for your time!

-Yale

ADD REPLY • link 4.8 years ago by ysantos101 ▴ 20

0

Entering edit mode

I'm afraid, as far as I am aware there is no easy way to do this for a bam file that has both in.

You could write a custom script to do this, for example using pysam.

Start by reading in the the SJ.tab.out file and go through one junction at a time.
Fetch all reads that overlap a junction, and then for each of these reads, check if it contains that splice junction
There is a function in pysam that allows you to return tuples of reference co-ordinates for alinged parts of the read .get_aligned_blocks() or something - you would need to go through this. You be looking for cases where one block ended on an intron start position, and the next block ended on an intron end position.
Once you've found such reads output them (and their mates)

An alternative would be to run STAR telling it only to output spliced reads if they match a known junction. I can't remember what the switch is called, but it does exist. Then run it through the filter in my first answer to isolate the spliced reads.

ADD REPLY • link 4.8 years ago by i.sudbery 20k

0

Entering edit mode

Great, thanks for the direction.

Have a nice day!

ADD REPLY • link 4.8 years ago by ysantos101 ▴ 20

0

Entering edit mode

I don't get, why you are filtering reads that start with @ (therefore are header) here? Could you please clarify that?

ADD REPLY • link 23 months ago by magdamaslon • 0

0

Entering edit mode

You need to include header lines in the output so that it can be converted back to BAM. if you just filtered lines that had an N in the cigar string, you would have no header, and wouldn't be able to convert back to valid bam.

ADD REPLY • link 23 months ago by i.sudbery 20k