Understanding salmon codes
1
1
Entering edit mode
5.8 years ago
Morris_Chair ▴ 370

Hello everyone, I want to count reads from RNAseq using salmon tool, probably because I'm not a bioinformatician (I'm trying hard to be..) but reading the guideline of salmon I'm not able to make it working. I was taught first to create an index which i did it by downloading the transcriptome and generate an index by this command line.

salmon index -t gencode.v29.transcripts.fa.gz -i gencode_v29_idx

Ideally I shoud quantify with gencode_v29_idx as salmon index but in the gencode_v29_idx folder there are banches of files which is not clear to me which file I shuould add to where.. On the manual it says to run this command line

salmon quant -i transcripts_index -l <LIBTYPE> -1 reads1.fq -2 reads2.fq -o˓→transcripts_quant

plus studying on the web I endeded up with this command line, I used the file indexing.log but I don't know if this is correct

(salmon) [@ws7910 RNAseq]$ salmon quant -i /gencode_v29_idx/indexing.log -l --libType A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -o salm_treated1

It needs to be adjusted but I don't know what

Thank you for help

RNA-Seq • 6.4k views
ADD COMMENT
0
Entering edit mode

Not sure but I don't think that salmon can index gzipped files. Might be anyway best to first gunzip the file and only then index it.

Edited

ADD REPLY
2
Entering edit mode

Actually, salmon can accept gzipped FASTA files for reference index creation as well as gzipped FASTA/Q files for quantification. I believe that Morris was able to build the index, but is running into problems while quantifying.

ADD REPLY
0
Entering edit mode

OK, point taken. indeed it does.

ADD REPLY
3
Entering edit mode
5.8 years ago
Rob 6.9k

Hi Morris,

You should add the index directory as the parameter to -i, so, try:

(salmon) [@ws7910 RNAseq]$ salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings -o salm_treated1

I've also removed a spurious / before gencode_v29_idx and also -l and --libType are redundant, so I fixed that. Let me know if this works for you.

ADD COMMENT
0
Entering edit mode

Hi Rob, thank you, it worked :) What surprises me is the value of the mapping rate which I guess is very low.

[2019-03-11 23:07:14.345] [jointLog] [warning] Only 125906 fragments were mapped, but the number of burn-in fragments was set to 5000000.
The effective lengths have been computed using the observed mappings.
[2019-03-11 23:07:14.345] [jointLog] [info] **Mapping rate = 18.5059%**

grazie

ADD REPLY
1
Entering edit mode

That can have plenty of reason. Insufficient read length, gDNA, rRNA or bacterial/whatever contaminations come to mind. What is the read length? Try blasting some of the unassigned reads (check the documentation on how to get them, it is well-explained in there).

ADD REPLY
0
Entering edit mode

Hi ATpoint, The read length is =101

(salmon) [@ws7910 my_fastq]$ cat Treated_1_m1.fastq | head
@HWI-ST571:93:C00B8ACXX:1:1301:8908:31103/1
CTGGACTCCACACTCTCCTGGGTTTCACCTTTGTAGCAGGATCCCTGCAGACCAGGCCCATGACAAACACCGTCTCCAGCGGGCAGAGCAAAGGAAGGGCA

The flag to write out the name of unassigned reads is this --writeUnmappedNames

but now there is another problem ... When I run the salmon code like

salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --       writeUnmappedNames -o salmon_unmappeed

I have this

**activate does not accept more than one argument:**
['salmon', 'quant', '-i', 'gencode_v29_idx', '-l', 'A', '-1', '/Treated_1_m1.fastq', '-2',.......

I couldn't find this specific problematic online to see how to fix it, do you guys know how to solve it?

thank you

ADD REPLY
1
Entering edit mode

Hi Morris,

It looks like you have a big space between the -- and the flag name that should follow immediately after. That is, your command line should be:

salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --writeUnmappedNames -o salmon_unmappeed

not

salmon quant -i gencode_v29_idx -l A -1 Treated_1_m1.fastq -2 Treated_1_m2.fastq -p 4 --validateMappings --       writeUnmappedNames -o salmon_unmappeed

The spaces matter in how a program parses arguments (not just salmon, but all programs). The -- sequence by itself is a special separator and indicates the end of a series of arguments for the program.

ADD REPLY
1
Entering edit mode

Hi Rob, I found the problem, basically today I modified the environmental variable with alias salmon='source activate salmon' in order to access to salmon just typing in the terminal salmon, but there was a conflict with conda ..,this problem solved :)

ADD REPLY
0
Entering edit mode

Hi ATpoint, The sequences unmapped that I found where human (those one that I checked), .. I will try to change parameters and see if I can improve the percentage of mapped reads.

(salmon) [@ws7910 my_fastq]$ grep -A 3 "C00B8ACXX:1:2205:7234:25095/1" Treated_1_m1.fastq
@HWI-ST571:93:C00B8ACXX:1:2205:7234:25095/1
 CTATCCACAGAGACACACCGGACTGAGGGGGGCCACCTGCCTCCTTCCACACGGGGAAGACCAGGGGGTGGCAACAGATGGGGCCCTGCCTCTGCCCTCAG
 +
CCCFFFDFHHGGHJJJJIIIJIIJJJJIIIJBDD?BBBDBDDDDD@ACDD>@BBD@BDDD?CCDDDDD8;@DCDB?ACDDCBBDDDDDDDDDDDDDBCBBD
(salmon) [p.panelli@ws7910 my_fastq]$ grep -A 3 "C00B8ACXX:1:2201:7717:123882/1" Treated_1_m1.fastq
@HWI-ST571:93:C00B8ACXX:1:2201:7717:123882/1
AGATGCCTCATCCTGGGTTCACCCCAGGGACCATCGGGGTCTGCCTTTCCCAGCATTCTCTCTCCCTTGAGGATTCTCCATTCCATCATTTTGGTGGCGTT +CCCFFFFFHHHHHJJJJIJJJJIJJJJJJJJJJIIJIJJ@GHJJJJJJIJJHJJJIHHHHHHHFFFFFCEEEEDDCDDDDDEEFEEDDDDDEC=AB?ADBB

Thank you for your thought

ADD REPLY
1
Entering edit mode

Hi Morris,

Can I ask how you verified that these are "human"? I took the sequences you gave above, and searched against the gencode transcriptome using Bowtie2 as the aligner. Both of these reads came back as completely unmapped. I did get a hit with blast (https://www.ncbi.nlm.nih.gov/nucleotide/NG_009253.1?report=genbank&log$=nuclalign&blast_rank=1&RID=8G9GH598015), but I wonder if this is related to the presence or absence of this gene in the transcriptome against which you are aligning (I realize that I suggested gencode, and that's because I usually find it to have among the best annotations, so not sure why this would be missing).

Edit: With the help of some folks on Biostars chat, it seems that these reads are intronic, which explains why they are not mapping to the spliced transcriptome.

ADD REPLY
0
Entering edit mode

Hello Rob, I did a nucleotide blast with this this tool

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome,

selecting any species and I obtained this result (for the first one )

Human DNA sequence from clone RP11-169K16and I got this

The extract is RNA so it's weird to be intronic.

Thank you

ADD REPLY
1
Entering edit mode

If you take the sequence above (CTATCCACAGAGACACACCGGACTGAGGGGGGCCACCTGCCTCCTTCCACACGGGGAAGACCAGGGGGTGGCAACAGATGGGGCCCTGCCTCTGCCCTCAG)and blat it at UCSC you can see that the only hit is intronic. Black bar with name test is the sequence above.

blat

Perhaps this molecule was captured before splicing?

ADD REPLY
0
Entering edit mode

Yes, it might be that it was captured before splicing, I don't think that the vast majority of the unmapped reads are intronic also because when I used featureCounts I got 75% of aligned reads.

Thank you genomax

ADD REPLY
0
Entering edit mode

That's a very surprising difference if the annotations are really the same. Would you be able to share the reads from one of your samples? We'd be interested to take a look.

ADD REPLY
0
Entering edit mode

sure, I can share the files let me know how

ADD REPLY
0
Entering edit mode

How large are they? You could use a file-sharing website, or if it is easier, you should be able to upload the data to this gdrive link.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6