Mapping reads to combined mouse and human genome
1
0
Entering edit mode
4.4 years ago
fifty_fifty ▴ 70

I have single cell RNA seq reads from patient-derived xenograft tumor. I want to see what is the rate of cells with mouse reads. This is my output when I aligned my reads to the human genome:

Started job on | Jun 26 11:25:38 Started mapping on | Jun 26 11:27:02 Finished on | Jun 26 13:02:31 Mapping speed, Million of reads per hour | 137.19

                      Number of input reads |   218324074
                  Average input read length |   119
                                UNIQUE READS:
               Uniquely mapped reads number |   137430056
                    Uniquely mapped reads % |   62.95%
                      Average mapped length |   92.13
                   Number of splices: Total |   6342135
        Number of splices: Annotated (sjdb) |   5771925
                   Number of splices: GT/AG |   5988664
                   Number of splices: GC/AG |   76096
                   Number of splices: AT/AC |   4963
           Number of splices: Non-canonical |   272412
                  Mismatch rate per base, % |   0.30%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.40
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.19
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   15005928
         % of reads mapped to multiple loci |   6.87%
    Number of reads mapped to too many loci |   311785
         % of reads mapped to too many loci |   0.14%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 65565156 % of reads unmapped: too short | 30.03% Number of reads unmapped: other | 11149 % of reads unmapped: other | 0.01% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

These are the results of the alignment to the mouse reference:

Started job on | Jun 26 08:29:55 Started mapping on | Jun 26 08:31:18 Finished on | Jun 26 11:14:04 Mapping speed, Million of reads per hour | 80.48

                      Number of input reads |   218324074
                  Average input read length |   119
                                UNIQUE READS:
               Uniquely mapped reads number |   17341041
                    Uniquely mapped reads % |   7.94%
                      Average mapped length |   92.97
                   Number of splices: Total |   1190336
        Number of splices: Annotated (sjdb) |   1140791
                   Number of splices: GT/AG |   1154502
                   Number of splices: GC/AG |   8153
                   Number of splices: AT/AC |   1012
           Number of splices: Non-canonical |   26669
                  Mismatch rate per base, % |   0.93%
                     Deletion rate per base |   0.02%
                    Deletion average length |   1.46
                    Insertion rate per base |   0.03%
                   Insertion average length |   1.16
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   2503443
         % of reads mapped to multiple loci |   1.15%
    Number of reads mapped to too many loci |   56181
         % of reads mapped to too many loci |   0.03%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 198421739 % of reads unmapped: too short | 90.88% Number of reads unmapped: other | 1670 % of reads unmapped: other | 0.00% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

However, I see a lot of recommendations to map the reads to a combined human and mouse reference genome instead. Can somebody explain the difference between mapping separately to each genome ref and to combined one?

  1. Can I say from the results above that only 8% of reads mapped to the mouse genome?
  2. Can I use just a bam file that resulted from alignment to human genome for my further analyses?

I am a newbie to bioinformatics, so I would really appreciate any recommendations/links what to read to understand the concepts of alignment and questions I asked above.

thank you!

RNA-Seq alignment STAR rna-seq • 1.8k views
ADD COMMENT
2
Entering edit mode
4.4 years ago
GenoMax 147k

Can somebody explain the difference between mapping separately to each genome ref and to combined one?

Problem of independently aligning data to two separate genomes is that sequences that are similar will map even though they may not have come from that genome. Aligning the data to two genomes at the same time will indicate if a read multi-maps across (and within) the genome. This will allow you to make a decision. It can go three ways.

  1. You can be strict and drop the read since you can't uniquely assign it to one genome
  2. You can assign it to both genomes and allow it to multi-map
  3. You could randomly choose one location among all where the read potentially maps equally well.

BBMap suite has a special tool called bbsplit.sh that allows you to do this kind of binning/mapping to multiple genomes in one step and make the decisions noted above. You can find a thread here. Options you should take a look at are ambiguous= and ambiguous2=.

As you were advised in your prior question by @swbarnes2, if this is 10x data you could align to the combined human/mouse reference that 10x provides and use their cellranger software (which uses STAR in turn). This will take care of cell barcodes/UMI etc.

ADD COMMENT
0
Entering edit mode

thank you! I used cellranger with mouse+human combined reference as you advised. The results say that ~9% reads are mapped to mm10 and ~90% of reads are mapped to hg19. Do you know how to filter out those reads that map to mouse then?

ADD REPLY

Login before adding your answer.

Traffic: 2032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6