Create custom references for Macaca fascicularis using Cellranger -arc for multi omics
0
0
Entering edit mode
4 weeks ago

Hi everyone. I did the multi omics including single nuclei RNA sequencing and single nuclei ATAC sequencing for Macaca fascicularis. I need to create the custom references to map the FASTQ files using Cell Ranger arc. Based on the instruction of Cellranger arc, I downloaded the Fasta (primary_assembly) and GTF file of Macaca_fascicularis_6.0 (GCA_011100615.1) in Ensembl. I combine all fasta file for primary_assembly into 1 file and combine with the GTF file (Macaca_fascicularis.Macaca_fascicularis_6.0.113.chr.gtf.gz). However, when I finished mapping and started the analysis using Python version 3.10, I got this error. Seem like there are two many duplicates of genes. enter image description here

I also tried mapping using the entire genome (toplevel fasta file). There was no error during data analysis using Python version 3.10. However, during data exploration. I noticed that the adata.var lacks one important gene for my cells. Then I tried the mapping using human genome. There were more genes mapped including the important gene missed using the Macaca_fascicularis toplevel fasta file. Thus, I think the mapping using the Macaca_fascicularis is not optimal yet.

I would like to ask someone already had experiences in making custom references genome:

  1. Why I have error if I use the primary_assembly fasta files? Because Cellranger arc recommended to use these files rather than the toplevel file.
  2. How to maximize the mapping step in my case? Because I cannot use the human reference genome.
-arc. Cellranger multi-omics. Macaca fascicularis.custom-references. • 660 views
ADD COMMENT
1
Entering edit mode

I downloaded the Fasta (primary_assembly) and GTF file of Macaca_fascicularis_6.0 (GCA_011100615.1) in Ensembl.

That does not appear right since the accession number is from NCBI. Assuming you do want the ensembl versions of the genome and annotations

If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.

You should not need to concatenate anything: https://ftp.ensembl.org/pub/release-113/fasta/macaca_fascicularis/dna/Macaca_fascicularis.Macaca_fascicularis_6.0.dna.toplevel.fa.gz

Equivalent GTF file should be this one: https://ftp.ensembl.org/pub/release-113/gtf/macaca_fascicularis/Macaca_fascicularis.Macaca_fascicularis_6.0.113.gtf.gz

ADD REPLY
0
Entering edit mode

Thank you so much for your reply. I understand.

So I think I used the fasta toplevel and gtf file that you mentioned and as you said and it is okay for use them rather than concatenating primary_assembly files.

However, even with the toplevel fasta files, I could not obtain the optimal mapping as when I tried with human references. Do you have any suggestion?

ADD REPLY
0
Entering edit mode

even with the toplevel fasta files, I could not obtain the optimal mapping as when I tried with human references

What does this mean? Do you have human data or Macaca data?

ADD REPLY
0
Entering edit mode

I tried with Macaca data but the number of genes mapped is lower than with Human (I tried human just for exploration). One of the well-known marker for my target cell type was absent in the dataset that I mapped to Macaca but it was in the gene list in the dataset that I mapped to human references

ADD REPLY
0
Entering edit mode

https://www.10xgenomics.com/support/software/cell-ranger-arc/latest/analysis/inputs/mkref The Cellranger-arc recommended for the FASTA and GTF files used for making the references. "FASTA and GTF files can be downloaded from sites like ENSEMBL and UCSC. The downloaded files are typically compressed. They must be uncompressed in order to process them in subsequent steps. As noted in the STAR manual, the most comprehensive genome sequence and annotations are recommended:

For the genome sequence, include all major chromosomes, unplaced and unlocalized scaffolds, but do not include patches and alternative haplotypes.

In Ensembl, the recommended genome file to download is annotated as "primary assembly." In NCBI, it is "no alternative - analysis set." For the GTF file, genes must be annotated with feature type 'exon' in column 3."

Do you understand what is "no alternative - analysis set." and how to find it?

I want to try this Macaca references in NCBI (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_037993035.1/) but the format of GTF file does not match with the format that Cellranger-arc requires. I do not know how to find " "no alternative - analysis set."

ADD REPLY
0
Entering edit mode

You can find the NCBI reference GTF in this directory: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/993/035/GCF_037993035.1_T2T-MFA8v1.0/GCF_037993035.1_T2T-MFA8v1.0_genomic.gtf.gz

This file does have the exon in column 3.

NC_088375.1     Gnomon  gene    47223   53403   .       -       .       gene_id "LOC135971452"; transcript_id ""; db_xref "GeneID:135971452"; description "uncharacterized LOC135971452>
NC_088375.1     Gnomon  transcript      47223   53403   .       -       .       gene_id "LOC135971452"; transcript_id "XR_010587769.1"; db_xref "GeneID:135971452"; experiment "COORDIN>
NC_088375.1     Gnomon  exon    50151   53403   .       -       .       gene_id "LOC135971452"; transcript_id "XR_010587769.1"; db_xref "GeneID:135971452"; experiment "COORDINATES: po>
NC_088375.1     Gnomon  exon    47223   47626   .       -       .       gene_id "LOC135971452"; transcript_id "XR_010587769.1"; db_xref "GeneID:135971452"; experiment "COORDINATES: po>

You will need to get the corresponding genome here: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/993/035/GCF_037993035.1_T2T-MFA8v1.0/GCF_037993035.1_T2T-MFA8v1.0_genomic.fna.gz

I tried with Macaca data but the number of genes mapped is lower than with Human

That may indicate some issue with your data rather than the reference. Perhaps your data has human contamination in it.

ADD REPLY
0
Entering edit mode

Thank you so much!!! I solved the problem.

ADD REPLY
0
Entering edit mode

Can you post/explain the solution? It would be useful for someone who may face a similar issue in future.

If the files I linked above worked then I can move that comment to an answer.

ADD REPLY

Login before adding your answer.

Traffic: 1251 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6