Hi! I have run Salmon for RNA-sequencing of cord blood samples. In the file quant.genes.sf I saw that some ENSGids have duplicates and for some of them, the counts differ. For example: For sample x
Column 1 is the ENSGids:
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
Column 2 is the associated counts: 74 154 86 10 143 46 60 53 0 0 895 5 5 7 0 616
So, I wonder if I should add a command in my original code to avoid ENSGid duplicates. And also, for further analysis, I cannot keep all the duplicates. I need one ENSGids that represent the corresponding gene, I cannot have duplicates. But, since the duplicates have different counts, how I choose which " counts" (of the ENSGid that have duplicates) to keep? I hope my question is clear.
Best, Francesca
The code I used to get gene counts with Salmon:
#!/bin/bash
# Shortcuts for Salmon paths
SALMON=/.conda/envs/salmon_env/bin/salmon
SALMON_INDEX=/Private/Update_2024/Salmon/salmon_index_v46
RNA_SEQ_FOLDER=/Raw_Data_Archive/Sequencing/Rna_Seq/Raw_Content/fastq
ANNOTATION_FILE=/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf
SALMON_OUTPUT_DIR=/Private/Update_2024/Salmon/Counts
LOG_DIR=/Private/Update_2024/Salmon/Condor
# Iterate through each sample directory in RNA_SEQ_FOLDER
for SAMPLE_DIR in "$RNA_SEQ_FOLDER"/*/; do
# Extract the sample ID (folder name without the trailing slash)
SAMPLE_ID=$(basename "$SAMPLE_DIR")
# Find the fastq files for the current sample
R1_FILE=$(ls "$SAMPLE_DIR"/*_R1_001.fastq.gz 2>/dev/null)
R2_FILE=$(ls "$SAMPLE_DIR"/*_R2_001.fastq.gz 2>/dev/null)
# Skip if R1 or R2 file is missing
if [[ -z "$R1_FILE" || -z "$R2_FILE" ]]; then
echo "Skipping $SAMPLE_ID: R1 or R2 file missing."
continue
fi
# Construct Salmon command
SALMON_COMMAND="quant -i $SALMON_INDEX -l A -1 $R1_FILE -2 $R2_FILE -p $cores -g $ANNOTATION_FILE -o ${SALMON_OUTPUT_DIR}/salmon_${SAMPLE_ID} --seqBias --gcBias --validateMappings"
# Run the Salmon command with csubmit.sh
echo "Running Salmon quantification for sample $SAMPLE_ID..."
csubmit.sh -g -b "$SALMON" -a "$SALMON_COMMAND" -m "$memory" -c "$cores" -i salmon_${SAMPLE_ID} -p "$LOG_DIR"
# Check if the command was successful
if [ $? -eq 0 ]; then
echo "Salmon quantification for sample $SAMPLE_ID completed successfully."
else
echo "Salmon quantification failed for sample $SAMPLE_ID!" >&2
exit 1
fi
done
How did you make your
salmon
index? Using the transcript file for v.46 from GENCODE?HI! I downloaded the transcript sequence in FASTA format from https://www.gencodegenes.org/human/release_46.html I got the gencode.v46.transcripts.fa.gz file. I run gunzip to decompress the file. Then I run the comman in bash:
When I was running the script I encountered an error:
It seemed that the Salmon index was incompatible with the current version of Salmon I was using (v1.10.3). RapMap was used in older versions of Salmon for building indices, but recent versions use Selective Alignment instead. So I rebuilt the index, in bash, using:
I used the "new" salmon_index_46 for my analysis.
I also noticed that in the result file called quant.genes.sf for each sample, I have in the first column the ENSTids, in the second column the ENSGids and in the last column (column 11) the NumReads. The duplicates are in the ENSGids column since different transcripts (ENSTids) have associated the same gene (ENSGids). I would like to analyse my RNA-seq for gene counts, not for transcripts. It seems that the duplicates are caused by the transcript reads still present in my result file quant.genes.sf. The duplicates are in the ENSGids column since different transcripts (ENSTids) are associated with the same gene (ENSGids). For example
Is there a command I can use in my code to get only gene counts? And should I use another FASTA file for the human genome v46?
Thanks Francesca
Take a look at: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html
Thanks for the input. I checked the documentation and tested the command tximport for one sample. I have got one message: transcripts missing from tx2gene: 135208. My Salmon quant.sf file had 253181 obs. (rows) of 5 variables (columns). The tx2gene file had 118684 obs. of 2 variables. The result file 999_gene_counts.csv had 62640 elements (column 1 ENSGids and column 2 gene counts). I wonder if I should consider the missing transcripts since they were not found in my gft file. Should I investigate that? I attached my R code: