I'm relatively new to this and have just gotten raw reads from SRA samples in PRJNA369152 (just trying to reanalyze their data for the learning experience).
I've only ever analyzed human and mouse datasets before. This is the tardigrade Hypsibius dujardini.
For the first time, I used salmon instead of STAR/etc.
I wasn't sure what the reference transcriptome was. I went ahead and used the one labeled 3.1.5.cds:
# Download the H. dujardini reference transcriptome
echo "Downloading reference";
wget http://download.tardigrades.org/v1/sequence/Hypsibius_dujardini_nHd.3.1.5.cds_translationid.fa.gz -O h_dujardini_3.1.5_cds.fa.gz;
# Build an index on reference transcriptome
echo "Building index";
salmon index -t h_dujardini_3.1.5_cds.fa.gz -i dujardini_index;
So I followed the salmon instructions and got something that looks pretty close to what I'm used to:
| Name | Length | EffectiveLength | TPM | NumReads |
|-----------------|--------|-----------------|-----------|----------|
| BV898_00001.p01 | 273 | 113.366 | 0.237119 | 1.000 |
| BV898_00002.p01 | 591 | 423.906 | 0.063413 | 1.000 |
| BV898_00003.p01 | 234 | 80.218 | 0.670204 | 2.000 |
| BV898_00004.p01 | 1254 | 1086.842 | 15.015658 | 607.102 |
| BV898_00004.p02 | 1233 | 1065.842 | 5.344195 | 211.898 |
But I don't really know what those names are. They look like locus IDs, based on what I can see from the genome assembly. Did I accidentally use the genome instead of the transcriptome as my reference?
How should I convert the BV** locus IDs into more familiar gene symbols or Homologene ID's? Or, did I use the wrong reference dataset?
Thanks for your time! :)