how to indentify unannotated exon-exon junction from a bam/sam file
1
0
Entering edit mode
3.5 years ago
17318598206 ▴ 20

I have perfromed de novo transcript assembly byusing Trinity, and now i have a sam file from Trinity, so how to indentify unannotated exon-exon junction from this sam file thanks

RNA-seq junction exon-exon • 1.5k views
ADD COMMENT
0
Entering edit mode

and which means "We extract the new junction from Trinity (i.e. read [0-9]+M[0-9]+N[0-9]+M pattern in the CIGAR string output by SAM)", and how should i code

ADD REPLY
0
Entering edit mode

You can find unannotated exon-exon junctions from this kind of split reads using 'N' operation in CIGAR string. For example, a CIGAR string of '25M1000N25M' indicates that 25 bp of the read matches to a region (25M) and then there is a 1,000 bp skip (1000N) and then the last 25 bp of the read match the sequence (25M). It could imply the read mapped to the exon-exon junction.

ADD REPLY
0
Entering edit mode

so, how to perform by using 'N' operation in CIGAR string? which app can do it ?

ADD REPLY
0
Entering edit mode
3.5 years ago

I am presuming that because you are looking for "unannotated" exon-exon junctions, there exists for your organism annotated junctions and that this implies the existence of a reference genome.

My understanding is that Trinity outputs a FASTA file, not a sam file - a sam file contains an alignment to a reference genome, the purpose of Trinity to to assembly transcripts de-novo without a reference genome. Thus, if you want to annotate exon-exon junctions from your transcript assembly, you need to align your assembled transcripts to a genome using a gapped aligner. For many years the most common choice for aligning cDNA sequences (which is effectively what your de-novo assembled transcripts are) was exonerate, although I suspect there are now more modern alternatives available.

This will produce a GTF file, showing the location of the exons on the the genome. You can then use the end of each exon and the start of the next to identify the exon-exon junctions, and compare to the junctions from an annotation.

ADD COMMENT
0
Entering edit mode

"We extracted novel junctions from Trinity (i.e., reads with [0-9]+M[0-9]+N[0-9]+M pattern in the CIGAR string of SAM output), and re-mapped all RNA-seq reads to these junctions, rescuing 1,402,444 reads in three replicates."This quote comes from the paper An Ancient Transcription Factor Initiates the Burst of piRNA Production during Early Meiosis in Mouse Testes, so i do not know what should ido

ADD REPLY
0
Entering edit mode

That sentence you quote immediate follows:

The assembled RNA sequences were aligned to the mouse genome (mm9) with BLAT 18 (Kent, 2002), and the alignments with more than 95% of sequence length mapped and fewer than 1% mismatches retained.

So the authors have taken the output from Trinity and aligned it to the genome using BLAT. Presumably BLAT can output in SAM format, and that is what is being reffered to as the "sam output" in your quote.

As for how to extract junctions from a BAM file: I don't know of any mainstream tools that do this, although googling "get junctions from BAM file" does produce hits you could investigate. Normally this sort of thing would be done with a small custom script. That would:

  1. Filter the reads to keep only those with splices.
  2. Extract the alignment start.
  3. Process the CIGAR string sequentially, adding M operations to the start location until an N is encountered. Set the current location as the start and the this start + the length of the N operation as the end.

Personally I favor something like pysam for doing this in python, although I'm sure others around here would do it with plain text parsing.

The alternative would be to choose a different output format form BLAT - one that could more easily be processed.

ADD REPLY

Login before adding your answer.

Traffic: 2831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6