Question

Basic guidance on Using 10x Genomics Data for De Novo Assembly

1

Entering edit mode

6 months ago

aleponce • 0

Hi everyone,

I'm a wet lab biologist transitioning to bioinformatics. I've just started working in a lab and one of my first tasks is to handle some unused whole-genome data from a few years ago, generated with 10x Genomics technology. The goal is to perform de novo assembly and eventually annotation.

From my research, the main option for assembly seems to be the 10x Genomics Supernova software. I have a couple of questions:

Are there other programs or open-source tools that can work easily with 10x Genomics data for de novo assembly?
The lab's PCs have at most 128 GB of RAM, but Supernova documentation recommends at least 256 GB. I’m considering using an AWS EC2 instance for the assembly and then performing the annotation on my local workstation. Does this approach make sense, or is there a better way to handle this? Any advice or insights would be greatly appreciated. This is my first time doing de novo assembly, working with 10x data, or using AWS.

Thanks for the help!

10x Assembly • 1.4k views

ADD COMMENT • link updated 5 months ago by GenoMax 148k • written 6 months ago by aleponce • 0

1

Entering edit mode

10x stopped supporting the "linked read" technology 4 years ago so you are going to have to work with supernova and hope that things will work with current versions of operating system (libraries etc can be an issue). You can try your luck with your lab comp first. Depending on the genome you are working with 128GB may be enough. I think those recommendations are for human sized genome.

I don't recall if this data could be used on its own to generate a complete genome de novo assembly. It was meant to improve local regions/phasing of the genomes. Hope your data is good quality.

ADD REPLY • link 6 months ago by GenoMax 148k

0

Entering edit mode

Thank you for answer!

I'm currently in the process of estimating the genome size, but I expect it to be in the 2-3 GB range. It's good to know that Supernova is the (only) way to go for de novo assembly with 10x Genomics data.

I did find a paper where they used 10x Genomics data and Supernova for de novo assembly, followed by MAKER and other tools for annotation, so at least I know it's possible.

Thanks again!

ADD REPLY • link 6 months ago by aleponce • 0

Ram · Answer 1 · 2024-06-21

2

Entering edit mode

6 months ago

sansan96 ▴ 140

Hello,

Certainly, as GenoMax says, this technology is already discontinued and the only program to perform genome assembly is with supernova.

The computational requirements will be according to the size of the genome to be analyzed. I recently analyzed a couple of 10X data sets and for an estimated 600 Mb genome, supernova needed 250 Gb of RAM, while for a 4.5 Gb genome it needed about 650 Gb of RAM. The smaller genome was processed in less than 48 hours, while the large one took about three weeks on a server with 20 cores.

Therefore, computational resources will depend on the size of the genome to be analyzed. If you want to estimate your genome size, you can process your 10X reads with longranger basic and then run Jellyfish.

I hope this can help you a little.

ADD COMMENT • link 6 months ago by sansan96 ▴ 140

0

Entering edit mode

Thank you, Enrique. It's great to have some reference points.

The genome will probably be in the 2-3 GB range, so my current setup definitely won't be enough. I'm surprised about the 3 weeks! On the Supernova website, they had some benchmarks, and nothing went above 3 days or so. What could make it take so long? I'm trying to get a rough idea of how much this would cost on AWS or if we should invest in a workstation for the lab.

I was just working on the genome size estimation today, but I'm a bit lost. I'm checking the FASTQ files, but I don't know how to tell out if they were generated through bcl2fastq or supernova/longranger mkfastq?

Thanks!

ADD REPLY • link 6 months ago by aleponce • 0

1

Entering edit mode

Hello,

I would think that the nature of the genome, diploid and with high heterozygosity, in addition to the fact that it is somewhat large and according to the supernova developers, genomes larger than 3Gb could not be supported (for experimentation).

Regarding the estimation of your genome, I understand that you already have your reads demultiplexed, so I would think you could run lonranger basic without problems. The flow I followed was to obtain the reads with mkfastq and from there the genome size estimation with lonranger basic/Jellyfish and the assembly with supernova.

I ran something like that

longranger basic \
 --id=longranger_plant \
 --fastqs=../10X_chromium_reads/ \
 --localcores 20 1>longranger-basic.stdout 2>longranger-basic.stderr \
 --localmem 50

The lonranger output has to be a barcoded.fastq.gz file

You can check this:

https://protocols.faircloth-lab.org/en/latest/protocols-computer/assembly/assembly-scaffolding-with-arks-and-links.html

https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/using/mkfastq

ADD REPLY • link 6 months ago by sansan96 ▴ 140

0

Entering edit mode

That worked! thank you so much. After running longranger basic, then jellyfish, I used GenomeScope to check the histogram. It seems my genome is 2,5 Gb, with a Heterozygosity of 0.5%.

How does that compare with the one that took 3 weeks for you?

I ask because the Max Run time I can get with my University Cluster is 6 days.

ADD REPLY • link 6 months ago by aleponce • 0

0

Entering edit mode

The characteristics of my genome (size, diploid, heterozygosity), I suggest that you run your assembly and see what happens.

ADD REPLY • link 6 months ago by sansan96 ▴ 140

0

Entering edit mode

If you have questions, you can ask me and maybe I can help you.

ADD REPLY • link 6 months ago by sansan96 ▴ 140

0

Entering edit mode

Thank you for the offer. It seems I was worrying in vain, it finished after 3 days using a 32 cores and 512gb of RAM. This is the supernova report:

SUMMARY
--------------------------------------------------------------------------------
- software release = 2.1.1(6bb16452a)
- likely sequencers = HiSeq X
- assembly checksum = 7,234,861,339,072,241,140
--------------------------------------------------------------------------------
INPUT
-  757.50 M  = READS                 = number of reads; ideal 800M-1200M for human
-  139.50  b = MEAN READ LEN         = mean read length after trimming; ideal 140
-   38.71  x = RAW COV               = raw coverage; ideal ~56
-   27.61  x = EFFECTIVE COV         = effective read coverage; ideal ~42 for raw 56x
-   66.27  % = READ TWO Q30          = fraction of Q30 bases in read 2; ideal 75-85
-  365.00  b = MEDIAN INSERT         = median insert size; ideal 350-400
-   84.33  % = PROPER PAIRS          = fraction of proper read pairs; ideal >= 75
-    1.00    = BARCODE FRACTION      = fraction of barcodes used; between 0 and 1
-    2.95 Gb = EST GENOME SIZE       = estimated genome size
-    8.41  % = REPETITIVE FRAC       = genome repetitivity index
-    0.07  % = HIGH AT FRACTION      = high AT index
-   40.29  % = ASSEMBLY GC CONTENT   = GC content of assembly
-    2.85  % = DINUCLEOTIDE FRACTION = dinucleotide content
-   20.32 Kb = MOLECULE LEN          = weighted mean molecule size; ideal 50-100
-   16.09    = P10                   = molecule count extending 10 kb on both sides
-  626.00  b = HETDIST               = mean distance between heterozygous SNPs
-    7.29  % = UNBAR                 = fraction of reads that are not barcoded
-  588.00    = BARCODE N50           = N50 reads per barcode
-   14.47  % = DUPS                  = fraction of reads that are duplicates
-   47.04  % = PHASED                = nonduplicate and phased reads; ideal 45-50
--------------------------------------------------------------------------------
OUTPUT
-    8.23 K  = LONG SCAFFOLDS        = number of scaffolds >= 10 kb
-   12.18 Kb = EDGE N50              = N50 edge size
-   58.55 Kb = CONTIG N50            = N50 contig size
-  324.75 Kb = PHASEBLOCK N50        = N50 phase block size
-    1.89 Mb = SCAFFOLD N50          = N50 scaffold size
-    4.09  % = MISSING 10KB          = % of base assembly missing from scaffolds >= 10 kb
-    2.27 Gb = ASSEMBLY SIZE         = assembly size (only scaffolds >= 10 kb)
--------------------------------------------------------------------------------
ALARMS
- The length-weighted mean molecule length is 20319.36 bases. The molecule
length estimation was successful, however, ideally we would expect a larger
value. Standard methods starting from blood can yield 100 kb or larger DNA, but
it can be difficult to obtain long DNA from other sample types. Short molecules
may reduce the scaffold and phase block N50 length, and could result in
misassemblies. We have observed assembly quality to improve with longer DNA.

I can see several issues (low coverage, the molecule length alert etc). Are these fixable issues? I have another 3 samples of the same organism to run. Could "merging" the info from the 4 samples into some sort of consensus genome compensate for the issues to get a good quality reference genome?

ADD REPLY • link updated 5 months ago by Ram 44k • written 6 months ago by aleponce • 0

0

Entering edit mode

Hello, maybe it can work, I have not really tried it but maybe it can help you. What I can tell you is that we had very bad results with that type of data: N50 and very low coverage, as well as high fragmentation. BUSCOs values were also poor.

ADD REPLY • link 6 months ago by sansan96 ▴ 140

0

Entering edit mode

It seems I got decent BUSCO results:

Input_file  Complete  Single  Duplicated  Fragmented  Missing  Scaffold N50  Contigs N50  Percent gaps
Sample_1    94.8      93      1.8         2.6         2.6      1,852,691     56,533       0.88%
Sample_2    94        92.1    1.9         3           3        1,461,438     54,680       0.94%
Sample_3    96.4      95.2    1.2         1.8         1.8      4,622,648     59,631       1.17%
Sample_4    94.3      92.5    1.8         2.9         2.8      1,233,328     56,451       0.96%

I've seen some studies where they use Rails/Cobbler to improve the assembly, but they don't go into much detail. It seems most of the papers I find using 10x data are using it along other types (Hi-C, lognread data etc). Has anybody had experience using just the 10x data and done something to Improve the assembly beyond the output that Supernova creates?

ADD REPLY • link updated 5 months ago by GenoMax 148k • written 5 months ago by aleponce • 0