Question

Mapping reads to combined mouse and human genome

0

Entering edit mode

5.1 years ago

fifty_fifty ▴ 90

I have single cell RNA seq reads from patient-derived xenograft tumor. I want to see what is the rate of cells with mouse reads. This is my output when I aligned my reads to the human genome:

Started job on | Jun 26 11:25:38 Started mapping on | Jun 26 11:27:02 Finished on | Jun 26 13:02:31 Mapping speed, Million of reads per hour | 137.19

                      Number of input reads |   218324074
                  Average input read length |   119
                                UNIQUE READS:
               Uniquely mapped reads number |   137430056
                    Uniquely mapped reads % |   62.95%
                      Average mapped length |   92.13
                   Number of splices: Total |   6342135
        Number of splices: Annotated (sjdb) |   5771925
                   Number of splices: GT/AG |   5988664
                   Number of splices: GC/AG |   76096
                   Number of splices: AT/AC |   4963
           Number of splices: Non-canonical |   272412
                  Mismatch rate per base, % |   0.30%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.40
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.19
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   15005928
         % of reads mapped to multiple loci |   6.87%
    Number of reads mapped to too many loci |   311785
         % of reads mapped to too many loci |   0.14%
                              UNMAPPED READS:

These are the results of the alignment to the mouse reference:

Started job on | Jun 26 08:29:55 Started mapping on | Jun 26 08:31:18 Finished on | Jun 26 11:14:04 Mapping speed, Million of reads per hour | 80.48

                      Number of input reads |   218324074
                  Average input read length |   119
                                UNIQUE READS:
               Uniquely mapped reads number |   17341041
                    Uniquely mapped reads % |   7.94%
                      Average mapped length |   92.97
                   Number of splices: Total |   1190336
        Number of splices: Annotated (sjdb) |   1140791
                   Number of splices: GT/AG |   1154502
                   Number of splices: GC/AG |   8153
                   Number of splices: AT/AC |   1012
           Number of splices: Non-canonical |   26669
                  Mismatch rate per base, % |   0.93%
                     Deletion rate per base |   0.02%
                    Deletion average length |   1.46
                    Insertion rate per base |   0.03%
                   Insertion average length |   1.16
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   2503443
         % of reads mapped to multiple loci |   1.15%
    Number of reads mapped to too many loci |   56181
         % of reads mapped to too many loci |   0.03%
                              UNMAPPED READS:

However, I see a lot of recommendations to map the reads to a combined human and mouse reference genome instead. Can somebody explain the difference between mapping separately to each genome ref and to combined one?

Can I say from the results above that only 8% of reads mapped to the mouse genome?
Can I use just a bam file that resulted from alignment to human genome for my further analyses?

I am a newbie to bioinformatics, so I would really appreciate any recommendations/links what to read to understand the concepts of alignment and questions I asked above.

thank you!

RNA-Seq alignment STAR rna-seq • 2.1k views

ADD COMMENT • link updated 5.1 years ago by GenoMax 152k • written 5.1 years ago by fifty_fifty ▴ 90

score 2 · Accepted Answer · 2020-07-01

Can somebody explain the difference between mapping separately to each genome ref and to combined one?

Problem of independently aligning data to two separate genomes is that sequences that are similar will map even though they may not have come from that genome. Aligning the data to two genomes at the same time will indicate if a read multi-maps across (and within) the genome. This will allow you to make a decision. It can go three ways.

You can be strict and drop the read since you can't uniquely assign it to one genome
You can assign it to both genomes and allow it to multi-map
You could randomly choose one location among all where the read potentially maps equally well.

BBMap suite has a special tool called bbsplit.sh that allows you to do this kind of binning/mapping to multiple genomes in one step and make the decisions noted above. You can find a thread here. Options you should take a look at are ambiguous= and ambiguous2=.

As you were advised in your prior question by @swbarnes2, if this is 10x data you could align to the combined human/mouse reference that 10x provides and use their cellranger software (which uses STAR in turn). This will take care of cell barcodes/UMI etc.