Hello,
I had a question about indexing with Salmon. I saw on the Salmon github pipeline that you can use the cDNA sequence with no alterations to create the index.
curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gzenter
salmon index -t athal.fa.gz -i athal_index
But on Salmons documentation they say there are two ways to create indices:
-The first is to compute a set of decoy sequences by mapping the annotated transcripts you wish to index against a hard-masked version of the organism’s genome. This can be done with e.g. MashMap2, and we provide some simple scripts to greatly simplify this whole process. Specifically, you can use the generateDecoyTranscriptome.sh script, whose instructions you can find in this README -The second is to use the entire genome of the organism as the decoy sequence. This can be done by concatenating the genome to the end of the transcriptome you want to index and populating the decoys.txt file with the chromosome names. Detailed instructions on how to prepare this type of decoy sequence is available here. This scheme provides a more comprehensive set of decoys, but, obviously, requires considerably more memory to build the index.
So, can I just use the cDNA file from Ensembl as mentioned above, or do I have to create indices how they mention in the documentation?
Thank you!
Thank you! I've made the decoy before, but I'm trying to redo it since I believe the first time I completed this incorrectly. The page Salmon references for creating suggests using GenCode, though they don't have what I am looking for. I usually use Ensembl. I am using Bos Taurus, so I decided to try NCBI RefSeq. For some reason the README file wont load after downloading. I believe I would use the file "GCF_002263795.2_ARS-UCD1.3_genomic.fna.gz" and "GCF_002263795.2_ARS-UCD1.3_rna.fna.gz" for this right? Or would I use "GCF_002263795.2_ARS-UCD1.3_rna_from_genomic.fna.gz"? Apologies for the simple question, I am not used to their annotation.
GENCODE is in reference to human data. GENCODE only deals with human, mouse data.
You can use the Arabidopsis genome from Ensembl.
Not sure why you are using an old release of Ensembl. Current release is 56 (but if you want to use release 28 then get the genome file from the same release). Following files are links for current release (as of this writing).
cDNA: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
genome: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
Thank you, yes I am using the newer version of Ensembl. Above with Arabidopsis is the example Salmon gave, I am using the Bos Taurus files. I have read to not use toplevel, but instead combine all chromosome primary files into one, but is toplevel okay for this then?
Which organism are you actually working on? You have mentioned three so far, arabidopsis, human and now cow.
As for using
top level
file, it is equivalent toprimary
file when following condition is met:I am working on cow, Bos Taurus. The Arabidopsis code was an example code that was on Salmon's github and an example for indexing they gave was using files from Gencode which is why I was hoping to use a different reference bank. Thank you, this was very helpful!