Low alignment with Trinity but high alignment in SAM files
1
0
Entering edit mode
6.3 years ago
Moneeb Bajwa ▴ 10

Hello,

I have performed alignment with Trinity and got only around 2% alignment. My SAM files and trimmed reads seemed to be fine; the SAM files had high alignment rate based on what was returned to me with samtools. Is this possible or am I doing something wrong with Trinity?

I appreciate the help

alignment genome RNA-Seq next-gen Trinity • 3.2k views
ADD COMMENT
1
Entering edit mode

Could you please share the commands used for both alignment as well for getting stat from SAM.

ADD REPLY
0
Entering edit mode

I just used gsnap for the SAM files by aligning trimmed mouse reads to the mouse genome (GRCm38.primary_assembly.genome.fa from GENCODE). I used gmap to build the database first and built intron index with the gencode.vM18.primary_assembly.annotation.gff3 file. I used this for the SAM stat:

#Count the total number of reads in the alignment
total="$(samtools view -c $samFile)"
#Count the aligned reads in the alignment
mapped="$(samtools view -F4 -c $samFile)"

For Trinity I did this (they were single-end, strand-specific forward reads):

nice -n19 $trinityPath/util/align_and_estimate_abundance.pl \
--transcripts Trinity-GG.fasta --aln_method bowtie2 --est_method eXpress \
--trinity_mode --output_dir $outPath$sample --seqType fq --SS_lib_type F \
--single $Reads --thread_count 1

And for the Trinity-GG.fasta I did this:

Trinity --genome_guided_bam MergedBams.bam \
--genome_guided_max_intron 10000 \
--max_memory 10G --CPU 4
ADD REPLY
1
Entering edit mode

What is your library prep kit? Is it really stranded? Is it from Illumina? In general, stranded Illumina kits use dUTP method, with Trinity, you should use --SS_lib_type R for single-end reads, and --SS_lib_type RF for paired-end reads.

Are the reads you are trying to align the same used to assemble the transcriptome?

Are you working with Mus musculus? Do you really need to assemble a transcriptome?

ADD REPLY
0
Entering edit mode

Here are the reads' link which gives all the info needed: https://www.ncbi.nlm.nih.gov/bioproject/PRJDB5631. I need to do differential expression analysis so I need to assemble the transcriptome right? Also the merged BAM file was only like 10G while the SAM files in total were around 150G; is that normal?

ADD REPLY
0
Entering edit mode

Yea actually it is still not working after rerunning the sorting and merging; the BAM files are quite small still and I believe that is messing everything up

ADD REPLY
0
Entering edit mode

I see a similar issue here in this post: Sam To Bam - Loss Of Data Or Just Great Compression?. Should I just accept the Trinity results and go ahead with differential expression analysis? These also are cancer derived organoids if that makes a difference.

ADD REPLY
2
Entering edit mode
6.3 years ago
h.mon 35k

First of all I will answer the question you asked on a comment, but it is the most important question here:

I need to do differential expression analysis so I need to assemble the transcriptome right?

You do not need to assemble the transcriptome to perform gene / transcript differential expression analysis. There is a very, very good genome assembly for mouse. Also, there is a very, very good annotation for the mouse genome - and consequently, there is a very good transcriptome for the mouse. You do not need to assemble, it is better to map / count using the existing reference genome / transcrptome. In addition, the reads from the project you linked are single-end and only 36bp long, they will not provide a good transcriptome assembly.

You can perform gene-level differential expression by first mapping the reads to the genome with STAR to get gene expression counts (as I have told you before here), and then using DESeq2 or edgeR to perform the differential expression statistical analysis.

You can perform transcript-level differential expression quantifying transcript counts with Salmon or kallisto, and performing the statistical analysis with sleuth.

Now, for your original question:

I did map one of the files to the mouse genome with STAR, obtaining gene counts:

Sox17   9368    81      9287
Mrpl15  4473    30      4443
Lypla1  3122    20      3102
Tcea1   397     6       391

One can use these counts to infer strandedness of the reads. I know the protocol is stranded, so the first column of counts is automatically excluded, as it represents unstranded counts (check STAR manual). It is easy to see then that the true counts are in the fourth column, which the STAR manual describes as:

column 4:  counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

This means the library prep kit produces what we call "reverse" stranded reads. The Trinity manual says the correct parameter to indicate this library type is --SS_lib_type R, which is the opposite of what you are using in your call to the align_and_estimate_abundance.pl script. So maybe if you use align_and_estimate_abundance.pl --SS_lib_type R you will get good mapping rate.

But remember, you do not need to assemble the transcriptome.

ADD COMMENT
0
Entering edit mode

I thank you greatly! But for the reverse strandedness, how could that be when the link clearly says forward? There is a big green arrow saying so: https://www.ncbi.nlm.nih.gov/sra/DRX083319[accn].

ADD REPLY
0
Entering edit mode

I also did try --SS_lib_type R, but it did not change anything for Trinity. I will just go ahead with the STAR option you mentioned. But I still am not sure why the STAR method gives reverse when the link says forward clearly. Maybe they are talking about two different things?

ADD REPLY
0
Entering edit mode

I had some more questions regarding that which I just made into a separate post here: Connect gene names and GO terms after STAR for DESeq2

ADD REPLY

Login before adding your answer.

Traffic: 1857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6