Dear all,
I want to count the reasds from RNA seq by using SALMON tool. I was told that the first thing to do is to create a fasta file with the sequence information (fasta file from reference genome + annotation file.GTF) such as :
(salmon) [@ws7910 RNAseq]$ gffread -w transcripts.fa -g Homo_sapiens.GRCh38.cdna.all.fa Homo_sapiens.GRCh37.75.gtf
I get this result
No fasta index found for Homo_sapiens.GRCh38.cdna.all.fa.
Rebuilding, please wait..
Fasta index rebuilt.
Error creating file: annotation/transcript/transcripts.fa
It creates a file named Homo_sapiens.GRCh38.cdna.all.fa.fai
On the salmon website seems that this step is not necessary so can you please tell me what am I doing wrong? and what is the best way to start using salmon?
I'd prefer to do the counting with BAM files already alignes using TopHat2
Thank you
Hi Rob, than you for you answer, why you omit the K-mer size for the index generation? -K
thanks
Hi Morris,
The
-k
argument has a default value (31) that is used if-k
is not provided. I you wish to use the default, you don't have to pass that option explicitly. If you want to use another value of k, then you can pass that to the index command.As a rule of thumb in my experience, the only really crucial thing is that the k-mer length is longer or equal to the read length. k-mer length only has a minor influence on the mapping rate, see Salmon Quantification for RNA-seq Read Pairs with Different Lengths to get a vague idea. Most important factor, given that the library is of high quality without contaminations is the read length which eventually comes down to a proper experimental design. When indexing GENCODE files also mind passing the
--gencode
option tosalmon
.I created the salmon index as you describe and it's ok. I tell you in advance that I'm newe to this and I'm lost for making the code to quantify the reads using gencode_v29_idx.
I have 4 bam files to quantify and I don't know how to formulate the code First, can I run multiple bam files? (two treated and two control for instance) or I'll have to do one by one? I thought that the structure of the code should bee like this but I don't know how to insert the gencode_v29_idx
thank you Rob
probably it's more covenient using fastq filles because there are not others bias