Hi!
I am doing RNA-sequencing analysis. I did STAR alignment (version 2.7.1a) and then Feature Counts (version 1.6.4). I tried the two option, -s 0 (unstranded) and -s 2 (reversely stranded.
The code attached below is the code I used with he -s 0 option. I run it also with option-s 2.
For most of my samples I have got: successfully assigned alignments: 36 % (-s 0) and 35 % (-s 2)
For a few of my samples:
- successfully assigned alignments: 1.6 % (-s 0) and no percentage (-s 2)
- successfully assigned alignments: 0.6 % (-s 0) and no percentage (-s 2)
- successfully assigned alignments: 3.2 % (-s 0) and no percentage (-s 2)
It seems that for these few samples with low assigned alignments there are not assigned alignments when I run feature counts code with the reversely stranded option.
I checked the alignments results in log.final.out:
for the samples with assigned alignment around 35 % Uniquely mapped reads % ~ 70-75 % % of reads mapped to multiple loci ~ 25-30%
for the samples with assigned alignment around 1 % Uniquely mapped reads % ~ 15-20 % % of reads mapped to multiple loci ~ 75 - 80 %
Aligned.sortedByCoord.out.bam
for the samples with assigned alignment around 35 % size around 5 000 000 000 bytes
for the samples with assigned alignment around 1 % size around 40 000 000 000 bytes
Questions:
- Is having around 35-36% assigned alignments normal?
- Why do few samples have so low assigned alignments in -s 0 option and not at all assigned alignments in s- 2?
- Why do the difference in Uniquely mapped reads and % of reads mapped to multiple loci have opposite values (%) in samples with 35% assigned alignments compared to samples with 1% assigned alignments?
I hope I have explained my issue clearly enough to make it understandable. Best regards, Francesca
#!/bin/bash
# MAKE NEW DIRECTORY FOR EACH SAMPLE IN ALIGNMENTS FOLDER
mkdir -p /ludc/Active_Projects/PONA/Private/Update_2024/Alignments/1334_S11/
# RUN STAR FOR EACH SAMPLE
/ludc/Tools/Software/STAR/2.7.1a/bin/STAR --genomeDir /ludc/Active_Projects/PONA/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14 \
--sjdbGTFfile /ludc/Active_Projects/PONA/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf \
--readFilesIn \
/ludc/Raw_Data_Archive/Sequencing/Rna_Seq/FoetalforNCD/Raw_Content/fastq/1334/1334_S11_R1_001.fastq.gz \
/ludc/Raw_Data_Archive/Sequencing/Rna_Seq/FoetalforNCD/Raw_Content/fastq/1334/1334_S11_R2_001.fastq.gz \
--outFilterMismatchNmax 20 --outFilterType BySJout --runThreadN 12 --alignSJDBoverhangMin 1 --sjdbOverhang 99 --outFilterMismatchNoverLmax 0.04 --outSAMtype BAM SortedByCoordinate --alignIntronMin 20 --alignIntronMax 1000000 --outReadsUnmapped Fastx --readFilesCommand zcat \
--outSAMstrandField intronMotif \
--outFileNamePrefix /ludc/Active_Projects/PONA/Private/Update_2024/Alignments/1334_S11/
# RUN FEATURECOUNTS FOR EACH SAMPLE
echo FEATURECOUNTS IS WORKING WITH SAMPLE 1334_S11
/ludc/Tools/Software/Subread/1.6.4/bin/featureCounts -a /ludc/Active_Projects/PONA/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf \
/ludc/Active_Projects/PONA/Private/Update_2024/Alignments/1334_S11/Aligned.sortedByCoord.out.bam -p -T 12 -s 0 -C -O \
-o /ludc/Active_Projects/PONA/Private/Update_2024/Counts/featureCounts_Gene/1334_S11.gene_counts.txt
# RUN FEATURECOUNTS FOR EACH SAMPLE ON EXON LEVEL
echo FEATURECOUNTS IS WORKING WITH SAMPLE 1334_S11
/ludc/Tools/Software/Subread/1.6.4/bin/featureCounts -a /ludc/Active_Projects/PONA/Private/Update_2024/Reference_genome/Release_46_GRCh38.p14/gencode.v46.primary_assembly.basic.annotation.gtf \
/ludc/Active_Projects/PONA/Private/Update_2024/Alignments/1334_S11/Aligned.sortedByCoord.out.bam -p -f -T 12 -s 0 -C -O \
-o /ludc/Active_Projects/PONA/Private/Update_2024/Counts/featureCounts_Exon/1334_S11.exon_counts.txt
echo ANALYSIS FOR SAMPLE 1334_S11 IS FINISHED
At least from my experience, given these are (I assume) human samples and the GRCh38 reference genome, I would expect a much higher % of uniquely mapped reads (from the STAR mapping output). I would suggest running something like
FastQC
on the FASTQ files to check if there might some oddities in the raw reads.The
-s 0/1/2
depend on how your library was prepped. If it was stranded then you need to choose either-s 1
(forward) or-s 2
reverse. Since you also haveR1
andR2
i.e. paired-end, it might also be good to include-p
and--countreadpairs
in the featureCoutns command.From some other posts (e.g. STAR output - high multi-mapping, low alignment (human)) the high % of multiple-mapped loci could also be due to rRNA sequences, which could also be investigated.