Question

Creating reference genome for mapping and then selecting

0

Entering edit mode

3.7 years ago

Dataminer ★ 2.8k

Dear community,

I will be analyzing PDx RNA-seq data and whatever information I could gather is that I need couple of things before I start:-

Combined reference genome of mouse and human (hg38 and mm10).

How can I generate this from hg38.fa and mm10.fa files?

Using combined reference genome for alignment using STAR.

What special features do I need to use so that only the reads that exclusively map to hg38 are selected and a Gene count can be generated.

Could anyone of you help me.

Thank you in advance

mm10 hg38 STAR RNA-seq • 3.3k views

ADD COMMENT • link updated 3.7 years ago by Istvan Albert 102k • written 3.7 years ago by Dataminer ★ 2.8k

0

Entering edit mode

Crossposted (bad practice to not even indicate it): https://bioinformatics.stackexchange.com/questions/15712/combine-genome-generation-and-alignment-for-pdx-rna-seq

ADD REPLY • link 3.7 years ago by ATpoint 85k

0

Entering edit mode

the post has been removed.

ADD REPLY • link 3.7 years ago by Dataminer ★ 2.8k

2

Entering edit mode

3.7 years ago

GenoMax 148k

You should use tools that can bin the reads by aligning to multiple genomes at the same time.

bbsplit.sh from BBMap suite (BBSplit syntax for generating builds for the reference genome and how to call different builds. ) and XenofilteR (LINK) are a couple of examples.

bbsplt.sh allows you to handle reads that multi-map (within and across genomes) intelligently via ambiguous2= option.

ambiguous2=<best>   Set behavior only for reads that map ambiguously to multiple different references.
                    Normal 'ambiguous=' controls behavior on all ambiguous reads;
                    Ambiguous2 excludes reads that map ambiguously within a single reference.
                       best   (use the first best site)
                       toss   (consider unmapped)
                       all   (write a copy to the output for each reference to which it maps)
                       split   (write a copy to the AMBIGUOUS_ output for each reference to which it maps)

ADD COMMENT • link 3.7 years ago by GenoMax 148k

score 2 · Accepted Answer · 2021-04-10

2

Entering edit mode

3.7 years ago

Istvan Albert 102k

The process is quite straightforward, simply concatenate the reference files, then index the resulting file.

You may need to rename the chromosomes (if for both organisms the naming is the same, i.e. chr1 then name the chromosomes for human genome as chr1_hg)

Once you perform the alignments you can easily select the uniquely mapped alignments by filtering for the chromosome (and flags).

samtools view -b -q 0 data.bam crh1_hg chr2_hg .... > filtered.bam

This resulting bam file can then be used in any downstream analysis.

ADD COMMENT • link 3.7 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you for the tips. One more question, can something similar be done while using Salmon. Like merge the genome files and index them and then run Salmon mapper. But then again, how to get human specific counts. Many thanks in advance

ADD REPLY • link 3.7 years ago by Dataminer ★ 2.8k

1

Entering edit mode

you would get the counts as you would get them in any other case, how do you know something maps to chromosome 1 of human genome? Because it is called that way.

There is nothing special in mixing additional chromosomes into your genome, all the aligner does is uses the information (data) you give it, it "does not care" that you have to chromosome 1s, one from human and one from mouse.

The only thing that matters is that you have to be able tell them apart by name, if you call them both chr1 then you won't be able to tell which chr1 belongs to which organism.

ADD REPLY • link 3.7 years ago by Istvan Albert 102k