transcripts.fa should contain the sequences of the transcripts you wish to quantitate, in fasta format. You should note however, that Salmon expects your reads to be aligned to the transcriptome (that is, aligned to the sequences in the transcripts.fa file, not the genome. If you look in your BAM file, the contigs should be the names of transcripts, not the names of chromosomes. See this note from the salmon documentation:
Note
Genomic vs. Transcriptomic alignments
Salmon expects that the alignment files provided are with respect to
the transcripts given in the corresponding fasta file. That is, Salmon
expects that the reads have been aligned directly to the transcriptome
(like RSEM, eXpress, etc.) rather than to the genome (as does, e.g.
Cufflinks). If you have reads that have already been aligned to the
genome, there are currently 3 options for converting them for use with
Salmon. First, you could convert the SAM/BAM file to a FAST{A/Q} file
and then use the lightweight-alignment-based mode of Salmon described
below. Second, given the converted FASTA{A/Q} file, you could re-align
these converted reads directly to the transcripts with your favorite
aligner and run Salmon in alignment-based mode as described above.
Third, you could use a tool like sam-xlate to try and convert the
genome-coordinate BAM files directly into transcript coordinates. This
avoids the necessity of having to re-map the reads. However, we have
very limited experience with this tool so far.
Firstly thanks for your response. I am still having problems in understanding this. 😅 Now I have bam files from STAR and I have gtf and Genome sequence, primary assembly (GRCh38) file from gencode. Lastly I have my sample's fastq files. As I understood from your reply, my sample's transcript file (because I want to quantify them :D)
Convert your .bam back to fastq. You'll want to sort your bam by query and not by coord for this step. This will make a subset of your original fastqs to only include the reads that mapped to the genome.
Convert your GTF file to a .fasta file. There's some examples around the web (gffread was incredibly simple). This will be your transcripts.fa file
This is because you likely mapped your original fastq files to your genome reference and not your transcriptome reference. My understanding of the alignment-based quant in Salmon is that your .bam file must have been aligned to the same .fasta file you are giving it (in this case, your transcripts.fa).
Hi, I'm about to use Salmon in the alignment-based mode, and I have a couple of questions.
1.) I used minimap to create alignment.sam files. Then I used SamTools to convert the .sam files to .bam files. Do I need to sort the .bam files by reference coordinates prior to giving the .bam file to Salmon for the following command...
./bin/salmon quant -t transcripts.fa -l <LIBTYPE> -a aln.bam -o salmon_quant
Or can I just give Salmon my .bam files without sorting?
2.) Does Salmon do any normalizing? For alignment-based mode, is there capability to give Salmon all of my .bam files which contain 3 replicates of 6 samples, so 18 .bam files.
I will be doing differential expression analysis downstream using edgeR, StageR and DEXseq.
Usually, STAR is used to align your reads to the genome. But the alignment mode of Salmon doesn't take reads aligned to the genome, it takes reads aligned to the transcriptome. Consider the following situation:
and will get out a BAM file that tells you that read1 maps to base 20 of chr1.
However, what salmon expects is for you to index a fasta file of the transcriptome sequence (called something like transcripts.fa or ensembl105.fa), that starts something like:
>transcript1
GGGACAGCTTACTCAGACTAC
align against that, and get a BAM file that says read1 maps to base 1 of transcriptA. This is inline with what tools like RSEM expect.
So, if you want to use salmon in alignment mode, you have two options, both involve repeating the alignment:
Generate the transcripts.fa by using the GTF file and the genome.fa and gffread, index this with STAR --genomeGenerate, and align your reads against this index.
STAR has an special mode where you can align to the genome, but output BAMs in transcript coordinates. You activate this mode by passing --quantMode TranscriptomeSAM to STAR. You will still need transcriptome.fa in order to run salmon.
Alternatively, if you don't have access to the original fastq files, you can turn you BAM alignment back into fastq using samtools with the right options, and then use that as input to salmon in fastq mode.
I aligned with --quantMode TranscriptomeSAM again. So with toTranscriptome.out.bam files and transcript.fa files. I tried to run salmon but I am getting error below:
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000463753.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309002.1|COL8A1-205|COL8A1|2964|processed_transcript| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000483969.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309004.1|COL8A1-207|COL8A1|832|processed_transcript| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000474648.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309005.1|COL8A1-206|COL8A1|841|processed_transcript| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000452013.5|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309003.2|COL8A1-203|COL8A1|659|protein_coding| appears in the reference but did not appear in the BAM^Z
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000261037.7|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000309001.1|COL8A1-201|COL8A1|5705|protein_coding| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000652472.1|ENSG00000144810.16|OTTHUMG00000148669.5|OTTHUMT00000504360.2|COL8A1-209|COL8A1|5515|protein_coding| appears in the reference but did not appear in the BAM
[2022-03-22 13:52:19.430] [jointLog] [warning] Transcript ENST00000273342.8|ENSG00000144810.16|OTTHUMG00000148669.5|-|COL8A1-202|COL8A1|3029|protein_coding| appears in the reference but did not appear in the BAM
This is a warning, not an error. It just says that you've got genes that get no reads. Its only a problem if every transcript is in the list and your output tables as zeros for everything. In that case it probably means that STAR is using different names for things than salmon is.
I am trying to create transcriptome file via gffread, but I see that I can only pass in gff annotation file not gtf.
seeing above "gffread -w output.transcriptome.fa -g genome.fa genome.gtf" I tried the following
"gffread -w GCF_000146045.2_R64_genomic_transcriptome_gtf.fa -g GCF_000146045.2_R64_genomic.fna GCF_000146045.2_R64_genomic.gtf"
but it resulted with following error "Error: no valid ID found for GFF record"
I tried to see if there is a specific flag need to be used for gtf , but noticed the only way to create transcriptome.fa file using gffread is via gff file and genome.fa. Is this correct?
Firstly thanks for your response. I am still having problems in understanding this. 😅 Now I have bam files from STAR and I have gtf and Genome sequence, primary assembly (GRCh38) file from gencode. Lastly I have my sample's fastq files. As I understood from your reply, my sample's transcript file (because I want to quantify them :D)
You're going to need to do some things:
Thank you, but I didnt understand why am I converting my bam file to fastq because salmon accepts the bam file. (I am doing alignment-based quant)
This is because you likely mapped your original fastq files to your genome reference and not your transcriptome reference. My understanding of the alignment-based quant in Salmon is that your .bam file must have been aligned to the same .fasta file you are giving it (in this case, your transcripts.fa).
Hi, I'm about to use Salmon in the alignment-based mode, and I have a couple of questions.
1.) I used minimap to create alignment.sam files. Then I used SamTools to convert the .sam files to .bam files. Do I need to sort the .bam files by reference coordinates prior to giving the .bam file to Salmon for the following command...
./bin/salmon quant -t transcripts.fa -l <LIBTYPE> -a aln.bam -o salmon_quant
Or can I just give Salmon my .bam files without sorting?
2.) Does Salmon do any normalizing? For alignment-based mode, is there capability to give Salmon all of my .bam files which contain 3 replicates of 6 samples, so 18 .bam files.
I will be doing differential expression analysis downstream using edgeR, StageR and DEXseq.
Usually, STAR is used to align your reads to the genome. But the alignment mode of Salmon doesn't take reads aligned to the genome, it takes reads aligned to the transcriptome. Consider the following situation:
If you just run STAR in the usual way, you index a fasta file of the genome (called something like genome.fa or hg38.fa) that starts something like:
and will get out a BAM file that tells you that read1 maps to base 20 of chr1.
However, what salmon expects is for you to index a fasta file of the transcriptome sequence (called something like transcripts.fa or ensembl105.fa), that starts something like:
align against that, and get a BAM file that says read1 maps to base 1 of transcriptA. This is inline with what tools like RSEM expect.
So, if you want to use salmon in alignment mode, you have two options, both involve repeating the alignment:
gffread
, index this withSTAR --genomeGenerate
, and align your reads against this index.--quantMode TranscriptomeSAM
to STAR. You will still need transcriptome.fa in order to run salmon.Alternatively, if you don't have access to the original fastq files, you can turn you BAM alignment back into fastq using samtools with the right options, and then use that as input to salmon in fastq mode.
I aligned with --quantMode TranscriptomeSAM again. So with toTranscriptome.out.bam files and transcript.fa files. I tried to run salmon but I am getting error below:
(Which is keeps going in the .err file)
This is a warning, not an error. It just says that you've got genes that get no reads. Its only a problem if every transcript is in the list and your output tables as zeros for everything. In that case it probably means that STAR is using different names for things than salmon is.
I was able to remove the warnings (which eventually ended up in an error) by creating transcriptome file from the genome and it's annotations (gtf):
I am trying to create transcriptome file via gffread, but I see that I can only pass in gff annotation file not gtf.
seeing above "gffread -w output.transcriptome.fa -g genome.fa genome.gtf" I tried the following "gffread -w GCF_000146045.2_R64_genomic_transcriptome_gtf.fa -g GCF_000146045.2_R64_genomic.fna GCF_000146045.2_R64_genomic.gtf"
but it resulted with following error "Error: no valid ID found for GFF record"
I tried to see if there is a specific flag need to be used for gtf , but noticed the only way to create transcriptome.fa file using gffread is via gff file and genome.fa. Is this correct?