Question

Low Salmon mapping rate even after stringent `fastp` (overrepresented GGGG permitted and troublesome Per tile sequence quality & Per base sequence content?)

0

Entering edit mode

7 days ago

Quang ▴ 10

Hi there,

I have some pair-end bulkRNAseq. I performed fastp as below, followed by FastQC to check the results, but despite doing --trim_poly_g in fastp, I still got failed flag of "GGGGG" with "No Hit" source in FastQC report.

In addition, FastQC report showed warnings of "Per base sequence content" towards the beginning of the reads, and a failed flag for "Per tile sequence quality".

The resulting Salmon quant onto the decoy-aware transcriptome gave a very low mapping rate of 33.

I read over several posts discussing the reason for low mapping rate, but any suggestions how to address this low mapping rate in this case?

Thank you for your help.

enter image description here

FASTP:

 fastp \
    --adapter_fasta /ceph/project/borrowlab/qnguyen/RAW_bulkRNASeq_TMNCTFHTREGCD8_QNN2024Aug19/X204SC24072759-Z01-F001_02/01.RawData/A_5/for_Trimming.fasta \
    --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    --qualified_quality_phred 5 \
    --unqualified_percent_limit 50 \
    --n_base_limit 15 \
    --overlap_len_require 30 \
    --overlap_diff_limit 1 \
    --overlap_diff_percent_limit 10 \
    --length_required 150 \
    --length_limit 150 \
    --trim_poly_g \
    -i "$file" \
    -I "$r2_file" \
    -o "$output_r1" \
    -O "$output_r2"

My for_Trimming.fasta

>ClontechSMART_1
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT
>ClontechSMART_2
GCTAATCATTGCAAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTT
>ClontechSMART_3
AGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTT
>ClontechSMART_4
GCTAATCATTGCAAGCAGTGGTATCAACGCAGAGTACTGTTTTTTTTTTT
>ClontechSMART_5
AAGCAGTGGTATCAACGCAGAGTACTGTTTTTTTTTTTTTTTTTTTTTTT
>ClontechSMART_6
GCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT
>Illumina_TruSeq_Adapter_Read_1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina_TruSeq_Adapter_Read_2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Salmon index

# Define file paths
genome="GRCh38.primary_assembly.genome.fa.gz"  # Genome file
transcriptome="gencode.v47.transcripts.fa.gz"  # GENCODE transcriptome file
output_index="gencode.v47_human_index_wDecoy"  # Output directory for the Salmon index
decoy_list="decoys.txt"                       # Decoy list file
combined_fasta="combined_transcriptome_and_genome.fa"  # Combined FASTA file

# Step 1: Create the decoy list
echo "Creating decoy list from genome file..."
grep "^>" <(gunzip -c "$genome") | cut -d " " -f 1 > "$decoy_list"
sed -i.bak -e 's/>//g' "$decoy_list"

# Step 2: Combine the transcriptome and genome FASTA files
echo "Combining transcriptome and genome FASTA files..."
zcat "$transcriptome" "$genome" > "$combined_fasta"

# Step 3: Build the Salmon decoy-aware index
echo "Building decoy-aware Salmon index..."
salmon index \
  -t "$combined_fasta" \
  -d "$decoy_list" \
  -i "$output_index" \
  -p 12 \
  --gencode

Salmon quant

salmon quant \
    -i "$index_path" \
    -l A \
    -1 "$r1_file" \
    -2 "$r2_file" \
    -o "$sample_output" \
    --writeMappings="${sample_output}/${sample_name}.salmon.sam" \
    --gcBias \
    --validateMappings

Referenced also: https://github.com/OpenGene/fastp/issues/589

fastp RNAseq • 608 views

ADD COMMENT • link updated 6 days ago by ATpoint 86k • written 7 days ago by Quang ▴ 10

score 0 · Answer 1 · 2024-12-15

0

Entering edit mode

6 days ago

ATpoint 86k

The polyG is in 0.5% of all reads, hence cannot be the reason. What sort of library is that and what is the experiment? It might simple be poor quality.

ADD COMMENT • link 6 days ago by ATpoint 86k