Question

featureCounts: Low percentage of assigned fragments

6

Entering edit mode

8.3 years ago

aggregatibacter ▴ 190

Hi,

I have Illumina paired-end RNA-Seq data (prepared with the TruSeq stranded kit) for human tissue biopsies.

After QC and alignment to the ENSEMBL genome and gtf (GRCh38 rel 84 from ensembl.org) using STAR (alignment perc. of 75% - 90%), I use featureCounts (in R) to count genes.

I get very different rates of successfully assigned fragments, ranging from 20% to about 60%, with the majority about 45%. Is this normal? Or could something be wrong with (some of) my libraries? Do I need to account for the different assignment proportions?

Many thanks!

RNA-Seq rna-seq • 22k views

ADD COMMENT • link 8.3 years ago by aggregatibacter ▴ 190

1

Entering edit mode

Take a look in a genome browser, try to figure out if the reads fall along non coding regions, introns, or perhaps you can track peaks in unexpected regions.

ADD REPLY • link 8.3 years ago by Asaf 10k

0

Entering edit mode

Are you using the correct "stranded" option for TruSeq which should be "reverse stranded"?

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

Yes, I use strandSpecific=-1. With forward stranded, I got less than 5% in all samples.

ADD REPLY • link 8.3 years ago by aggregatibacter ▴ 190

0

Entering edit mode

Did you use -s 2 with featureCounts?

ADD REPLY • link 8.3 years ago by GenoMax 147k

0

Entering edit mode

Yes, or -1 (which seems to be the same thing).

ADD REPLY • link 8.3 years ago by aggregatibacter ▴ 190

0

Entering edit mode

You may have a lot of multi-mapped reads. Those would not be assigned a gene. What is you alignment rate for unique alignments only?

ADD REPLY • link 8.3 years ago by igor 13k

0

Entering edit mode

I have 75% - 90% of uniquely mapped reads. % of reads mapped to multiple loci ranges from 15% to about 5% - could this be the reason for the observed low assignment rate? Would you suggest to (i) accept this and go on, or (ii) try to count the multiple loci reads, too (which is possible in featurecounts, if I am not mistaken).

ADD REPLY • link 8.3 years ago by aggregatibacter ▴ 190

0

Entering edit mode

That number of uniquely mapped reads is fine, nothing to worry about. Most likely obvious, but you did align to hg38 and count for the same genome? Could you share the featureCounts command/code you used?

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Sure:

counts <- featureCounts(files=files, annot.ext="/data/STARgenomes/ensembl/Homo_sapiens.GRCh38.84.gtf", isPairedEnd=TRUE, isGTFAnnotationFile=TRUE, strandSpecific=2, nthreads=12, useMetaFeatures=TRUE)

I used the gft I gave to STAR when generating its genome -

STAR --runThreadN 12 --runMode genomeGenerate --genomeDir /data/STARgenomes/ensembl/ --genomeFastaFiles /data/STARgenomes/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /data/STARgenomes/ensembl/Homo_sapiens.GRCh38.84.gtf

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by aggregatibacter ▴ 190

0

Entering edit mode

I also get this warning, but that should be rather normal because of the multi-mapping? I don´t have STAR sort my SAMs - is that ok?

 WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 26620478                                              ||
||    Successfully assigned fragments : 12283289 (46.1%)                      ||
||    Running time : 0.69 minutes

ADD REPLY • link updated 8.3 years ago by GenoMax 147k • written 8.3 years ago by aggregatibacter ▴ 190

0

Entering edit mode

Seems nothing to worry about. Could you tell me, for example for this sample, how many reads you had (before mapping!). It is possible that some reads map each many many times, and as such inflate the 26620478, and lower the percentage. In fact, you want to compare the 'Successfully assigned fragments' to the 'uniquely mapped' reads.

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

I feel you may be right, it may all be full of multi-mapping reads.

Different from the defaults, I use several options the STAR manual presents as ENCODE standard options (and I therefore found them possibly useful)

--outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 20 --alignIntronMax 10000 --alignMatesGapMax 1000000

Possibly, those will add to further inflation?

See below for the logfile for the sample I pasted before (and many thanks for you efforts, btw!)

                             Started job on |   Jul 20 06:14:44
                         Started mapping on |   Jul 20 06:14:45
                                Finished on |   Jul 20 06:19:14
   Mapping speed, Million of reads per hour |   264.22

                      Number of input reads |   19743421
                  Average input read length |   246
                                UNIQUE READS:
               Uniquely mapped reads number |   16959018
                    Uniquely mapped reads % |   85.90%
                      Average mapped length |   245.42
                   Number of splices: Total |   11930188
        Number of splices: Annotated (sjdb) |   11861337
                   Number of splices: GT/AG |   11858385
                   Number of splices: GC/AG |   61520
                   Number of splices: AT/AC |   6388
           Number of splices: Non-canonical |   3895
                  Mismatch rate per base, % |   0.21%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.73
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.55
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1908781
         % of reads mapped to multiple loci |   9.67%
    Number of reads mapped to too many loci |   9447
         % of reads mapped to too many loci |   0.05%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   4.32%
                 % of reads unmapped: other |   0.06%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

ADD REPLY • link 8.3 years ago by aggregatibacter ▴ 190

0

Entering edit mode

Based on this, 12283289 were assigned out of 16959018 uniquely mapping reads. That's already better than the fraction you had earlier :)

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Many thanks for your insightful comments. I did a little reading on multi-mapping reads. From what I gather, I would rather avoid to count all of them in an RNA Seq diff exp analysis, would you agree?

ADD REPLY • link 8.3 years ago by aggregatibacter ▴ 190

1

Entering edit mode

I guess most software by default ignores multimapping reads so you don't have to worry about then. But you could argue that ignoring those reads is a loss of information. You cannot just combine the reads with the uniquely mapped, but I find the method described in this paper an interesting approach.

ADD REPLY • link 8.3 years ago by WouterDeCoster 47k

0

Entering edit mode

You could choose to map them to a random best site (that way you would not lose that information). ambig=random is what BBMap uses for this purpose.

ADD REPLY • link 8.3 years ago by GenoMax 147k