I have perfromed de novo transcript assembly byusing Trinity, and now i have a sam file from Trinity, so how to indentify unannotated exon-exon junction from this sam file thanks
I have perfromed de novo transcript assembly byusing Trinity, and now i have a sam file from Trinity, so how to indentify unannotated exon-exon junction from this sam file thanks
I am presuming that because you are looking for "unannotated" exon-exon junctions, there exists for your organism annotated junctions and that this implies the existence of a reference genome.
My understanding is that Trinity outputs a FASTA file, not a sam file - a sam file contains an alignment to a reference genome, the purpose of Trinity to to assembly transcripts de-novo without a reference genome. Thus, if you want to annotate exon-exon junctions from your transcript assembly, you need to align your assembled transcripts to a genome using a gapped aligner. For many years the most common choice for aligning cDNA sequences (which is effectively what your de-novo assembled transcripts are) was exonerate, although I suspect there are now more modern alternatives available.
This will produce a GTF file, showing the location of the exons on the the genome. You can then use the end of each exon and the start of the next to identify the exon-exon junctions, and compare to the junctions from an annotation.
"We extracted novel junctions from Trinity (i.e., reads with [0-9]+M[0-9]+N[0-9]+M pattern in the CIGAR string of SAM output), and re-mapped all RNA-seq reads to these junctions, rescuing 1,402,444 reads in three replicates."This quote comes from the paper An Ancient Transcription Factor Initiates the Burst of piRNA Production during Early Meiosis in Mouse Testes, so i do not know what should ido
That sentence you quote immediate follows:
The assembled RNA sequences were aligned to the mouse genome (mm9) with BLAT 18 (Kent, 2002), and the alignments with more than 95% of sequence length mapped and fewer than 1% mismatches retained.
So the authors have taken the output from Trinity and aligned it to the genome using BLAT. Presumably BLAT can output in SAM format, and that is what is being reffered to as the "sam output" in your quote.
As for how to extract junctions from a BAM file: I don't know of any mainstream tools that do this, although googling "get junctions from BAM file" does produce hits you could investigate. Normally this sort of thing would be done with a small custom script. That would:
Personally I favor something like pysam
for doing this in python, although I'm sure others around here would do it with plain text parsing.
The alternative would be to choose a different output format form BLAT - one that could more easily be processed.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
and which means "We extract the new junction from Trinity (i.e. read [0-9]+M[0-9]+N[0-9]+M pattern in the CIGAR string output by SAM)", and how should i code
You can find unannotated exon-exon junctions from this kind of split reads using 'N' operation in CIGAR string. For example, a CIGAR string of '25M1000N25M' indicates that 25 bp of the read matches to a region (25M) and then there is a 1,000 bp skip (1000N) and then the last 25 bp of the read match the sequence (25M). It could imply the read mapped to the exon-exon junction.
so, how to perform by using 'N' operation in CIGAR string? which app can do it ?