Duplicate ENSGids in the result file quant.genes.sf from Salmon
0
0
Entering edit mode
4 days ago

Hi! I have run Salmon for RNA-sequencing of cord blood samples. In the file quant.genes.sf I saw that some ENSGids have duplicates and for some of them, the counts differ. For example: For sample x

Column 1 is the ENSGids:
ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16 ENSG00000183878.16

Column 2 is the associated counts: 74 154 86 10 143 46 60 53 0 0 895 5 5 7 0 616

So, I wonder if I should add a command in my original code to avoid ENSGid duplicates. And also, for further analysis, I cannot keep all the duplicates. I need one ENSGids that represent the corresponding gene, I cannot have duplicates. But, since the duplicates have different counts, how I choose which " counts" (of the ENSGid that have duplicates) to keep? I hope my question is clear.

Best, Francesca

The code I used to get gene counts with Salmon:

 #!/bin/bash

# Shortcuts for Salmon paths
SALMON=/.conda/envs/salmon_env/bin/salmon
SALMON_INDEX=/Private/Update_2024/Salmon/salmon_index_v46
RNA_SEQ_FOLDER=/Raw_Data_Archive/Sequencing/Rna_Seq/Raw_Content/fastq
ANNOTATION_FILE=/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf
SALMON_OUTPUT_DIR=/Private/Update_2024/Salmon/Counts
LOG_DIR=/Private/Update_2024/Salmon/Condor

# Iterate through each sample directory in RNA_SEQ_FOLDER
for SAMPLE_DIR in "$RNA_SEQ_FOLDER"/*/; do
    # Extract the sample ID (folder name without the trailing slash)
    SAMPLE_ID=$(basename "$SAMPLE_DIR")

    # Find the fastq files for the current sample
    R1_FILE=$(ls "$SAMPLE_DIR"/*_R1_001.fastq.gz 2>/dev/null)
    R2_FILE=$(ls "$SAMPLE_DIR"/*_R2_001.fastq.gz 2>/dev/null)

    # Skip if R1 or R2 file is missing
    if [[ -z "$R1_FILE" || -z "$R2_FILE" ]]; then
        echo "Skipping $SAMPLE_ID: R1 or R2 file missing."
        continue
    fi

    # Construct Salmon command
    SALMON_COMMAND="quant -i $SALMON_INDEX -l A -1 $R1_FILE -2 $R2_FILE -p $cores -g $ANNOTATION_FILE -o ${SALMON_OUTPUT_DIR}/salmon_${SAMPLE_ID} --seqBias --gcBias --validateMappings"

    # Run the Salmon command with csubmit.sh
    echo "Running Salmon quantification for sample $SAMPLE_ID..."
    csubmit.sh -g -b "$SALMON" -a "$SALMON_COMMAND" -m "$memory" -c "$cores" -i salmon_${SAMPLE_ID} -p "$LOG_DIR"

    # Check if the command was successful
    if [ $? -eq 0 ]; then
        echo "Salmon quantification for sample $SAMPLE_ID completed successfully."
    else
        echo "Salmon quantification failed for sample $SAMPLE_ID!" >&2
        exit 1
    fi
done
quant.genes.sf Salmon • 347 views
ADD COMMENT
0
Entering edit mode

How did you make your salmon index? Using the transcript file for v.46 from GENCODE?

ADD REPLY
0
Entering edit mode

HI! I downloaded the transcript sequence in FASTA format from https://www.gencodegenes.org/human/release_46.html I got the gencode.v46.transcripts.fa.gz file. I run gunzip to decompress the file. Then I run the comman in bash:

salmon index -t /Private/Update_2024/Salmon/gencode.v46.transcripts.fa -I salmon_index_v46

When I was running the script I encountered an error:

“This version of salmon does not support indexing using the RapMap index,”

It seemed that the Salmon index was incompatible with the current version of Salmon I was using (v1.10.3). RapMap was used in older versions of Salmon for building indices, but recent versions use Selective Alignment instead. So I rebuilt the index, in bash, using:

salmon index -t /Private/Update_2024/Salmon/gencode.v46.transcripts.fa -i /Private/Update_2024/Salmon/salmon_index_v46

I used the "new" salmon_index_46 for my analysis.

I also noticed that in the result file called quant.genes.sf for each sample, I have in the first column the ENSTids, in the second column the ENSGids and in the last column (column 11) the NumReads. The duplicates are in the ENSGids column since different transcripts (ENSTids) have associated the same gene (ENSGids). I would like to analyse my RNA-seq for gene counts, not for transcripts. It seems that the duplicates are caused by the transcript reads still present in my result file quant.genes.sf. The duplicates are in the ENSGids column since different transcripts (ENSTids) are associated with the same gene (ENSGids). For example

ENSTENST00000470460.1|ENSG00000226941.9|

ENST00000250831.6|ENSG00000226941.9|

Is there a command I can use in my code to get only gene counts? And should I use another FASTA file for the human genome v46?

Thanks Francesca

ADD REPLY
1
Entering edit mode

Take a look at: https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html

Import and summarize transcript-level abundance estimates for transcript- and gene-level analysis with Bioconductor packages, such as edgeR, DESeq2, and limma-voom.

ADD REPLY
0
Entering edit mode

Thanks for the input. I checked the documentation and tested the command tximport for one sample. I have got one message: transcripts missing from tx2gene: 135208. My Salmon quant.sf file had 253181 obs. (rows) of 5 variables (columns). The tx2gene file had 118684 obs. of 2 variables. The result file 999_gene_counts.csv had 62640 elements (column 1 ENSGids and column 2 gene counts). I wonder if I should consider the missing transcripts since they were not found in my gft file. Should I investigate that? I attached my R code:


# Load necessary libraries
library(dplyr)
library(readr)
library(tximport)
library(rtracklayer)
library(GenomicFeatures)

# Path to Gencode GTF file
gtf_file <- "/PONA/Cohorts_Analysis/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf"

# Import GTF file using rtracklayer
gtf <- import(gtf_file)
head(gtf)

# Extract transcript-to-gene mapping (TXNAME -> GENEID)
tx2gene <- gtf %>%
  as.data.frame() %>%
  filter(type == "transcript") %>%  # Ensure we only have transcript entries
  select(TXNAME = transcript_id, GENEID = gene_id)  # Select only the relevant columns

# View the first few rows of tx2gene to ensure it's correct
head(tx2gene)

# Paths to Salmon quant files
quant_files <- list.files("/PONA/Cohorts_Analysis/Results/Results_EdegR_Salmon", pattern = "999_quant.sf", full.names = TRUE)
head(quant_files)

# Read a quant.sf file and clean the 'Name' column
quant_data <- read_tsv(quant_files[1], show_col_types = FALSE)
quant_data <- quant_data %>%
  mutate(Name = sub("\\|.*", "", Name))  # Remove everything after the first '|'

head(quant_data$Name)  # Should now match tx2gene IDs

# Proceed with tximport to import Salmon data and summarize counts at the gene level
txi <- tximport(quant_files, 
                type = "salmon", 
                tx2gene = tx2gene, 
                countsFromAbundance = "lengthScaledTPM", 
                ignoreAfterBar = TRUE)

# Check results (first few gene-level counts)
head(txi$counts)

# Output gene-level counts to CSV
gene_counts <- txi$counts
write.csv(gene_counts, file = "/PONA/Cohorts_Analysis/Results/Results_EdegR_Salmon/999_gene_counts.csv", row.names = TRUE)
dim(gene_counts)

# Print summary
print("Gene-level counts have been saved to 999_gene_counts.csv.")
ADD REPLY

Login before adding your answer.

Traffic: 2400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6