Question

featureCounts results: low rate of 'Successfully assigned alignments'

0

Entering edit mode

12 months ago

Hamtaro ▴ 50

Hello everybody.

I'm a newbie in RNA-seq Analysis, and I have this situation that I don't really understand.

While working with featureCounts for RNA-seq read quantification, I came across an intriguing issue. The rate of successfully assigned alignments turned out to be unexpectedly low, totalling just 15463270 (7.6%). This was rather surprising, especially considering that HISAT2 yielded a high mapping rate of 97.11% overall alignment rate. Upon scrutinizing the summary file.

Any insights or suggestions regarding this situation would be greatly appreciated.

The Human Reference genome (hg38) and annotation file was download from Ensembl and the index was created using hisat2-build.

This is the output from HISAT2 from one example. Most of the samples have similar:

hisat2 -p 24 -x ~/Documents/Reference_genomes/H.sapiens_Ensembl/idx/genome -1 ./Data/fastq/P115_L3_1.fq.gz -2 ./Data/fastq/P115_L3_2.fq.gz | samtools sort > ./Results/bam/P115_1mes_EKRN230023385-1A_HFC7FDSX7_L3.bam

And Results

37521707 reads; of these:
  37521707 (100.00%) were paired; of these:
    1699858 (4.53%) aligned concordantly 0 times
    12794982 (34.10%) aligned concordantly exactly 1 time
    23026867 (61.37%) aligned concordantly >1 times
    ----
    1699858 pairs aligned concordantly 0 times; of these:
      62360 (3.67%) aligned discordantly 1 time
    ----
    1637498 pairs aligned 0 times concordantly or discordantly; of these:
      3274996 mates make up the pairs; of these:
        2169082 (66.23%) aligned 0 times
        623965 (19.05%) aligned exactly 1 time
        481949 (14.72%) aligned >1 times
97.11% overall alignment rate
[bam_sort_core] merging from 16 files and 1 in-memory blocks...

The code I run for FeatureCounts is the following one:

cat ids.txt | parallel -j 1 echo "./Results/bam/{}.bam" | \                                                                
                xargs featureCounts -p -a ~/Documents/Reference_genomes/H.sapiens_Ensembl/Homo_sapiens.GRCh38.110.gtf -o ./Results/counts.txt

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/     v2.0.6
//========================== featureCounts setting ===========================\ || || || Input files : 164 BAM files
|| ||
|| ||
P115_L1.bam || ||
P115_L2.bam || ||
P115_L3.bam || ||
P115_L4.bam || ||

|| || Output file : counts.txt
|| || Summary : counts.txt.summary
|| || Paired-end : yes
|| || Count read pairs : no
|| || Annotation : Homo_sapiens.GRCh38.110.gtf (GTF)
|| || Dir for temp files : ./Results
|| ||
|| || Threads : 23
|| || Level : meta-feature level
|| || Multimapping reads : not counted
|| || Multi-overlapping reads : not counted
|| || Min overlapping bases : 1
|| ||
|| \============================================================================//

//================================= Running ==================================\ || || || Load annotation file Homo_sapiens.GRCh38.110.gtf ...
|| || Features : 1649677
|| || Meta-features : 62754
|| || Chromosomes/contigs : 47
|| ||
|| || Process BAM file P115_L1.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 204560676
|| || Successfully assigned alignments : 15463270 (7.6%)
|| || Running time : 0.08 minutes
|| ||
|| || Process BAM file P115_L2.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 204560676
|| || Successfully assigned alignments : 15463270 (7.6%)
|| || Running time : 0.08 minutes
|| ||
|| || Process BAM file P115_L3.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 217887388
|| || Successfully assigned alignments : 18352057 (8.4%)
|| || Running time : 0.08 minutes
|| ||
|| || Process BAM file P115_L4.bam... || || Paired-end reads are included.
|| || The reads are assigned on the single-end mode.
|| || Total alignments : 217887388
|| || Successfully assigned alignments : 18352057 (8.4%)
|| || Running time : 0.08 minutes
||

Thank you in advance.

I think I found the possible problem. After carry out the fastQC, I have some FAIL in Sequence Duplication Levels and Overrepresented sequences.

I run a blast with the sequences and most of them belongs to globin.

Do you think that is because they didn't carry out depletion hgbRNA depletion?

fastQC

FeatureCounts HISAT2 RNA-seq • 1.6k views

ADD COMMENT • link updated 11 months ago by Yang • 0 • written 12 months ago by Hamtaro ▴ 50

score 1 · Answer 1 · 2023-11-28

1

Entering edit mode

12 months ago

kvn95ss ▴ 20

I could see the following clues from -

hisat2 - 23026867 (61.37%) aligned concordantly >1 times

And from featureCounts - Multimapping reads : not counted

I think this is a decent lead - many of the reads are multimapping, and by default only unique reads are included by featureCounts. Try changing to include multimapping reads, but you would have to identify why you have soo many multimapping reads.

ADD COMMENT • link 12 months ago by kvn95ss ▴ 20

0

Entering edit mode

Thank you for your reply. I how can I identify why I have soo many multimapping reads?

The samples comes from blood from patients under different treatment. Could this be the cause?

ADD REPLY • link 12 months ago by Hamtaro ▴ 50

0

Entering edit mode

why I have soo many multimapping reads?

That is a characteristic of the libraries. Not much you can do at this point in process. Just having blood from patients with different treatments should not cause this.

ADD REPLY • link 12 months ago by GenoMax 147k

0

Entering edit mode

I added the fastQC. What do you think of the possible cause I have proposed?

ADD REPLY • link 12 months ago by Hamtaro ▴ 50

0

Entering edit mode

Almost all your samples have the over represented sequence, you can blast the top 10 recurring sequences and see what it could be.

ADD REPLY • link 12 months ago by kvn95ss ▴ 20

0

Entering edit mode

I run a blast with the sequences and most of them belongs to globin.

Globin reads shown in your FastQC report are not going to add up to more than 7-8% of the data.

Since globins can account for 30-80% of blood RNA the depletion has certainly worked albeit not 100%.

ADD REPLY • link 12 months ago by GenoMax 147k

0

Entering edit mode

What I do not really know if they have performed the depletion... These are samples that were sent to NovoGene for sequencing and I have received them now for analysis....