Duplicate ENSGids in the result file quant.genes.sf from Salmon
0
0
Entering edit mode
15 hours ago

Hi! I have run Salmon for RNA-sequencing of cord blood samples. In the file quant.genes.sf I saw that some ENSGids have duplicates and for some of them, the counts differ. For example: For sample x

Column 1 is the ENSGids:
ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16

Column 2 is the associated counts: 74 154 86 10 143 46 60 53 0 0 895 5 5 7 0 616

So, I wonder if I should add a command in my original code to avoid ENSGid duplicates. And also, for further analysis, I cannot keep all the duplicates. I need one ENSGids that represent the corresponding gene, I cannot have duplicates. But, since the duplicates have different counts, how I choose which " counts" (of the ENSGid that have duplicates) to keep? I hope my question is clear.

Best, Francesca

The code I used to get gene counts with Salmon:

 #!/bin/bash

# Shortcuts for Salmon paths
SALMON=/.conda/envs/salmon_env/bin/salmon
SALMON_INDEX=/Private/Update_2024/Salmon/salmon_index_v46
RNA_SEQ_FOLDER=/Raw_Data_Archive/Sequencing/Rna_Seq/Raw_Content/fastq
ANNOTATION_FILE=/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf
SALMON_OUTPUT_DIR=/Private/Update_2024/Salmon/Counts
LOG_DIR=/Private/Update_2024/Salmon/Condor

# Iterate through each sample directory in RNA_SEQ_FOLDER
for SAMPLE_DIR in "$RNA_SEQ_FOLDER"/*/; do
    # Extract the sample ID (folder name without the trailing slash)
    SAMPLE_ID=$(basename "$SAMPLE_DIR")

    # Find the fastq files for the current sample
    R1_FILE=$(ls "$SAMPLE_DIR"/*_R1_001.fastq.gz 2>/dev/null)
    R2_FILE=$(ls "$SAMPLE_DIR"/*_R2_001.fastq.gz 2>/dev/null)

    # Skip if R1 or R2 file is missing
    if [[ -z "$R1_FILE" || -z "$R2_FILE" ]]; then
        echo "Skipping $SAMPLE_ID: R1 or R2 file missing."
        continue
    fi

    # Construct Salmon command
    SALMON_COMMAND="quant -i $SALMON_INDEX -l A -1 $R1_FILE -2 $R2_FILE -p $cores -g $ANNOTATION_FILE -o ${SALMON_OUTPUT_DIR}/salmon_${SAMPLE_ID} --seqBias --gcBias --validateMappings"

    # Run the Salmon command with csubmit.sh
    echo "Running Salmon quantification for sample $SAMPLE_ID..."
    csubmit.sh -g -b "$SALMON" -a "$SALMON_COMMAND" -m "$memory" -c "$cores" -i salmon_${SAMPLE_ID} -p "$LOG_DIR"

    # Check if the command was successful
    if [ $? -eq 0 ]; then
        echo "Salmon quantification for sample $SAMPLE_ID completed successfully."
    else
        echo "Salmon quantification failed for sample $SAMPLE_ID!" >&2
        exit 1
    fi
done
quant.genes.sf Salmon • 80 views
ADD COMMENT
0
Entering edit mode

How did you make your salmon index? Using the transcript file for v.46 from GENCODE?

ADD REPLY

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6