Looking for a thorough annotation for non-primary assembly units in GRCm38
0
0
Entering edit mode
7.0 years ago
erwan.scaon ▴ 950

Hello everyone,

We are working with the latest GRC reference genome release for Mus musculus, GRCm38.p6. It contains primary assembly units (chr1, chr2, ...) as well as unlocalized, unplaced scaffolds, alternate loci scaffolds and patches.

While checking for my region of interest (IgH locus), I noticed that there are 3 IgH alternate loci scaffolds.
They are described in the assembly_regions.txt :

Igh 12 113067904 114458697 alt-scaffold GL456068.1 NT_096355.1 BALB/c

Igh 12 113067904 114458697 alt-scaffold GL456078.1 NT_166322.1 129S7/SvEvBrd-Hprt-b-m2

Igh 12 113067904 114458697 alt-scaffold GL456017.2 NT_114985.3 129S1/SvImJ

I am glad they exists, since our IgH locus is 129S1/SvImJ, I can expect better mapping.
We are mapping reads on this reference genome, and counting at the gene level with featureCounts on the BAMs.

Annotation used within featureCounts is GENCODE_M15 "ALL". This GTF/GFF3 is described as :

It contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)

When looking into the GENCODE_M15 annotation, I find no record for IgH alt scaffolds (GL456068, GL456078, GL456017).
Ps : I am aware that this annotation represent a lot of hard work and I am not here to complain about it.

Ps : featureCounts need a file for the "chrAliases" option, to "map" GRCm38.p6 sequences names with GENCODE name :

chr1,CM000994.2
...
chrM,AY172335.1

To get a clear view of the discrepancy between GRCm38.p6 & GENCODE_M15, we can create a list of all sequences names contained in GRCm38.p6 (based on assembly_report "GenBank-Accn", removing accession version and applying the same aliases we used with featureCounts, yielding a file list.txt with 239 entries) and check it against the gencode annotation :

 awk -F "\t" '!/^#/ {print $1}' gencode.vM15.chr_patch_hapl_scaff.annotation.gtf > dummy.gtf
 while read line; do printf $line"\t"; grep -ci $line dummy.gtf; done <list.txt >out.tsv;

Result :

chr1    767011
chr7    149746
chr2    149608
...
GL456068    0
GL456078    0
GL456017    0
...
KZ289093    0
KZ289094    0
KZ289095    0

We have 86 sequence with annotations in GENCODE and 153 without (including our IgH alt scaffolds).

Now, the problem is that featureCounts is not going to count reads mapping to these IgH alt scaffolds, since we have no annotations for them. Thus, given my IgH locus (129S1/SvImJ), I can expect a lot of reads to map on GL456017.2 but never count them with featureCounts.

What to do ? :
1) Find a way to get annotations for GL456017.2 ? (I did check vs annotations sources here : https://www.ensembl.org/info/data/ftp/index.html & there : http://vega.archive.ensembl.org/Mus_musculus/Info/Index, but got nothing new for GL456017.2).
2) Remove IgH alt scaffolds from GRCm38.p6 & map against this "new reference" ? (need to check how many mapped reads we are going to lose here).

Ps : I am aware that GENCODE_M15 corresponds to GRCm38.p5, but the IgH alt scaffolds that we are focusing on were already in the GRCm38.p5, so the question is valid.

GRCm38 mus musculus GTF GFF GENCODE • 2.5k views
ADD COMMENT
0
Entering edit mode

You could make your own entries (for the purpose of counting). I assume GL456017.2 has a separate sequence entry in your reference.

ADD REPLY
0
Entering edit mode

You mean creating our own annotation (GTF/GFF) for all the things contained inside the 1390793 nts of GL456017.2 ?
I am not sure if this is doable or not yet.

Edit : Yes, GL456017.2 is a separated entry (sequence).

ADD REPLY
0
Entering edit mode

Since that sequence is present since 2012 and has not been annotated so far you don't have much choice. Are you able to steal annotation from somewhere else (Mouse: Igh 12 113067904 114458697 alt-scaffold GL456017.2 NT_114985.3 129S1/SvImJ) that is chr12:113067904-114458697?

Do you have smaller sections of this sequence you are interested in or the whole region? You could add one entry for the entire sequence to see how many reads are mapped there and work to refine the GTF entry you make up by adding smaller sections. Sorry, easy for me to recommend, but lots of work for you.

ADD REPLY
0
Entering edit mode

To get a glimpse of what we are potentially losing atm with featureCounts given the lack of annotations for IgH alt scaffolds (GL456068, GL456078 & GL456017), I ran samtools (on 20 BAMs from 20 samples) on the GRCm38.p6 region (chr12:113067904-114458697) which have the 3 IgH alt scaffolds :

for f in *.bam; do printf $f"\t"; samtools view -c $f CM001005.2:113067904-114458697; done;

Number of alignments within the primary assembly unit (C57BL/6J strain) : 4,034,926

for f in *.bam; do echo $f; samtools view -c $f GL456017.2 GL456078.1 GL456068.1; done;

Number of alignments within the 3 alt scaffolds : 5,673,089

Ps : 95% of the alignements within the 3 alt scaffolds are on GL456017.2, which is expected given that it is the 129S1/SvImJ strain, the same as our mouses.

That's pretty scary.

Regarding your suggestion, I don't have the biological knowledge to "manually" annotate these alt scaffolds. If the main differences between C57BL/6J & 129S1/SvImJ on the IgH locus are SNPs, it could be fine, just transposing annotation retaining the same positions. But if thing are shifted, it is much harder.

ADD REPLY
0
Entering edit mode

I never said the solution was going to be simple :) You may need to get help from a mouse biologist if you must have finer detail counts for these alt scaffolds.

How are you handling multi-mapping in your data/aligner? Especially since you have multiple sequences that are essentially similar? That is going to affect your counts. Remember to have featureCounts count multi-mappers (which it does not do by default).

ADD REPLY
0
Entering edit mode

The first thing I see myself trying here is removing the 3 IgH alt scaffolds (GL456068, GL456078 & GL456017) from the reference genome.
We suppose that the C57BL/6J primary assembly unit is close enough to the alt scaffolds, allowing reads originating from a 129S1/SvImJ IgH locus to map on the C57BL/6J locus.

ADD REPLY

Login before adding your answer.

Traffic: 2597 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6