Hi there,
I have some pair-end bulkRNAseq. I performed fastp
as below, followed by FastQC
to check the results, but despite doing --trim_poly_g
in fastp
, I still got failed flag of "GGGGG" with "No Hit" source in FastQC report.
In addition, FastQC report showed warnings of "Per base sequence content" towards the beginning of the reads, and a failed flag for "Per tile sequence quality".
The resulting Salmon
quant
onto the decoy-aware transcriptome gave a very low mapping rate of 33.
I read over several posts discussing the reason for low mapping rate, but any suggestions how to address this low mapping rate in this case?
Thank you for your help.
FASTP:
fastp \
--adapter_fasta /ceph/project/borrowlab/qnguyen/RAW_bulkRNASeq_TMNCTFHTREGCD8_QNN2024Aug19/X204SC24072759-Z01-F001_02/01.RawData/A_5/for_Trimming.fasta \
--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
--qualified_quality_phred 5 \
--unqualified_percent_limit 50 \
--n_base_limit 15 \
--overlap_len_require 30 \
--overlap_diff_limit 1 \
--overlap_diff_percent_limit 10 \
--length_required 150 \
--length_limit 150 \
--trim_poly_g \
-i "$file" \
-I "$r2_file" \
-o "$output_r1" \
-O "$output_r2"
My for_Trimming.fasta
>ClontechSMART_1
AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT
>ClontechSMART_2
GCTAATCATTGCAAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTT
>ClontechSMART_3
AGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTT
>ClontechSMART_4
GCTAATCATTGCAAGCAGTGGTATCAACGCAGAGTACTGTTTTTTTTTTT
>ClontechSMART_5
AAGCAGTGGTATCAACGCAGAGTACTGTTTTTTTTTTTTTTTTTTTTTTT
>ClontechSMART_6
GCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTT
>Illumina_TruSeq_Adapter_Read_1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>Illumina_TruSeq_Adapter_Read_2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Salmon index
# Define file paths
genome="GRCh38.primary_assembly.genome.fa.gz" # Genome file
transcriptome="gencode.v47.transcripts.fa.gz" # GENCODE transcriptome file
output_index="gencode.v47_human_index_wDecoy" # Output directory for the Salmon index
decoy_list="decoys.txt" # Decoy list file
combined_fasta="combined_transcriptome_and_genome.fa" # Combined FASTA file
# Step 1: Create the decoy list
echo "Creating decoy list from genome file..."
grep "^>" <(gunzip -c "$genome") | cut -d " " -f 1 > "$decoy_list"
sed -i.bak -e 's/>//g' "$decoy_list"
# Step 2: Combine the transcriptome and genome FASTA files
echo "Combining transcriptome and genome FASTA files..."
zcat "$transcriptome" "$genome" > "$combined_fasta"
# Step 3: Build the Salmon decoy-aware index
echo "Building decoy-aware Salmon index..."
salmon index \
-t "$combined_fasta" \
-d "$decoy_list" \
-i "$output_index" \
-p 12 \
--gencode
Salmon quant
salmon quant \
-i "$index_path" \
-l A \
-1 "$r1_file" \
-2 "$r2_file" \
-o "$sample_output" \
--writeMappings="${sample_output}/${sample_name}.salmon.sam" \
--gcBias \
--validateMappings
Referenced also: https://github.com/OpenGene/fastp/issues/589