Hi! I have run Salmon for RNA-sequencing of cord blood samples. In the file quant.genes.sf I saw that some ENSGids have duplicates and for some of them, the counts differ. For example: For sample x
Column 1 is the ENSGids:
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
ENSG00000183878.16
Column 2 is the associated counts: 74 154 86 10 143 46 60 53 0 0 895 5 5 7 0 616
So, I wonder if I should add a command in my original code to avoid ENSGid duplicates. And also, for further analysis, I cannot keep all the duplicates. I need one ENSGids that represent the corresponding gene, I cannot have duplicates. But, since the duplicates have different counts, how I choose which " counts" (of the ENSGid that have duplicates) to keep? I hope my question is clear.
Best, Francesca
The code I used to get gene counts with Salmon:
#!/bin/bash
# Shortcuts for Salmon paths
SALMON=/.conda/envs/salmon_env/bin/salmon
SALMON_INDEX=/Private/Update_2024/Salmon/salmon_index_v46
RNA_SEQ_FOLDER=/Raw_Data_Archive/Sequencing/Rna_Seq/Raw_Content/fastq
ANNOTATION_FILE=/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf
SALMON_OUTPUT_DIR=/Private/Update_2024/Salmon/Counts
LOG_DIR=/Private/Update_2024/Salmon/Condor
# Iterate through each sample directory in RNA_SEQ_FOLDER
for SAMPLE_DIR in "$RNA_SEQ_FOLDER"/*/; do
# Extract the sample ID (folder name without the trailing slash)
SAMPLE_ID=$(basename "$SAMPLE_DIR")
# Find the fastq files for the current sample
R1_FILE=$(ls "$SAMPLE_DIR"/*_R1_001.fastq.gz 2>/dev/null)
R2_FILE=$(ls "$SAMPLE_DIR"/*_R2_001.fastq.gz 2>/dev/null)
# Skip if R1 or R2 file is missing
if [[ -z "$R1_FILE" || -z "$R2_FILE" ]]; then
echo "Skipping $SAMPLE_ID: R1 or R2 file missing."
continue
fi
# Construct Salmon command
SALMON_COMMAND="quant -i $SALMON_INDEX -l A -1 $R1_FILE -2 $R2_FILE -p $cores -g $ANNOTATION_FILE -o ${SALMON_OUTPUT_DIR}/salmon_${SAMPLE_ID} --seqBias --gcBias --validateMappings"
# Run the Salmon command with csubmit.sh
echo "Running Salmon quantification for sample $SAMPLE_ID..."
csubmit.sh -g -b "$SALMON" -a "$SALMON_COMMAND" -m "$memory" -c "$cores" -i salmon_${SAMPLE_ID} -p "$LOG_DIR"
# Check if the command was successful
if [ $? -eq 0 ]; then
echo "Salmon quantification for sample $SAMPLE_ID completed successfully."
else
echo "Salmon quantification failed for sample $SAMPLE_ID!" >&2
exit 1
fi
done
How did you make your
salmon
index? Using the transcript file for v.46 from GENCODE?