Mapping percent difference between hg38 and hg19
0
1
Entering edit mode
8.9 years ago

Hey,

Is it normal that I have a drop of ~10% mapping using hg38 compared to hg19?

I mean I mapped the same set of samples with the same tool and under the same condition. First, alignment against hg19 gives an average of 80s % while alignment against hg38 dropped to an average of 70s %.

Why would that tend to happen?

rna-seq alignment • 3.9k views
ADD COMMENT
0
Entering edit mode

Exact command was? This could explain it if you only look at unambiguously mapping reads.

ADD REPLY
0
Entering edit mode

I used STAR, the only difference is that while building the index for hg38 I included the annotation gtf file in to the command. I didn't do that with hg19. the alignment command was the same for both!

Would that have an effect?

ADD REPLY
0
Entering edit mode

I also used both genome.fa and annotation file from ensembl in case of hg38, while from UCSC in case of hg19.

ADD REPLY
0
Entering edit mode

I don't know STAR. What was the exact alignment command you used? How does it report unambiguously mapping reads?

ADD REPLY
0
Entering edit mode

the command I used was:

STAR --genomeDir /home/hg38 --sjdbGTFfile /home/hg38.gtf --runThreadN 10 --outSAMstrandField intronMotif --readFilesIn /home/fastq_1  /home/fastq_2 --outFileNamePrefix sample1Star
# same for hg19

This is the summary for a samlpe mapped against hg38:

                      Number of input reads |    27873030
                  Average input read length |    202
                                UNIQUE READS:
               Uniquely mapped reads number |    21309030
                    Uniquely mapped reads % |    76.45%
                      Average mapped length |    200.36
                   Number of splices: Total |    9873696
        Number of splices: Annotated (sjdb) |    9782721
                   Number of splices: GT/AG |    9786065
                   Number of splices: GC/AG |    73676
                   Number of splices: AT/AC |    8042
           Number of splices: Non-canonical |    5913
                  Mismatch rate per base, % |    0.24%
                     Deletion rate per base |    0.01%
                    Deletion average length |    1.48
                    Insertion rate per base |    0.01%
                   Insertion average length |    1.49
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |    5205343
         % of reads mapped to multiple loci |    18.68%
    Number of reads mapped to too many loci |    33763
         % of reads mapped to too many loci |    0.12%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |    0.00%
             % of reads unmapped: too short |    4.75%
                 % of reads unmapped: other |    0.01%
                              CHIMERIC READS:
                   Number of chimeric reads |    0
                        % of chimeric reads |    0.00%

This is the summary for the same sample mapped against hg19:

                      Number of input reads |    27873030
                  Average input read length |    202
                                UNIQUE READS:
               Uniquely mapped reads number |    24359828
                    Uniquely mapped reads % |    87.40%
                      Average mapped length |    198.69
                   Number of splices: Total |    9840758
        Number of splices: Annotated (sjdb) |    9656134
                   Number of splices: GT/AG |    9744383
                   Number of splices: GC/AG |    72088
                   Number of splices: AT/AC |    7342
           Number of splices: Non-canonical |    16945
                  Mismatch rate per base, % |    0.50%
                     Deletion rate per base |    0.01%
                    Deletion average length |    2.04
                    Insertion rate per base |    0.02%
                   Insertion average length |    1.61
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |    657341
         % of reads mapped to multiple loci |    2.36%
    Number of reads mapped to too many loci |    3645
         % of reads mapped to too many loci |    0.01%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |    0.00%
             % of reads unmapped: too short |    10.23%
                 % of reads unmapped: other |    0.01%
                              CHIMERIC READS:
                   Number of chimeric reads |    0
                        % of chimeric reads |    0.00%
ADD REPLY
0
Entering edit mode

And what does the manual of STAR say about mapping of unambiguous reads? What does the manual of STAR say about the use of a GTF file in reference to mapping? You have read the manual, right?

ADD REPLY
0
Entering edit mode

Nothing about mapping of unambiguous reads!!

The use of a GTF file in reference to mapping is Highly recommended!!

ADD REPLY
0
Entering edit mode

It also says something about use of GTF file affecting alignments. Also, unambiguous reads are discussed in the manual (e.g. under multimappers). Not my job to read the manual. If you go through it and compare your reference genomes, unmapped reads, where they map in the other reference, etc. I'm sure you'll figure out what's happening. Good luck!

ADD REPLY
0
Entering edit mode

STAR paper in the Current Protocols in Bioinformatics says "The gene annotations allow STAR to identify and correctly map spliced alignments across known splice junctions. While it is possible to run the mapping jobs without annotations, it is not recommended. When gene annotations are not available, use the 2-pass mapping "

You could map against hg19 without the annotations and see if the percentage drops accordingly but that would be an academic exercise.

ADD REPLY

Login before adding your answer.

Traffic: 2095 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6