N50 reduces after scaffolding
1
2
Entering edit mode
8.5 years ago
JstRoRR ▴ 60

Hi,

I have a weird situation. I am trying to scaffold a genome assembly using few Illumina mate pair libraries. The original assembly scaffold N50 is 303KB but after scaffolding using SSPACE the N50 reduces close to 160KB. I dont understand why this is happening? What confuses me is that with every mate pair library the scaffolder outputs some good numbers under Satisfied in distance/logic within a given contig pair (pre-scaffold). Here is the whole scaffolding run log:

READING READS LIB20870:
------------------------------------------------------------
Total inserted pairs = 10654771
------------------------------------------------------------
READING READS LIB20871:
------------------------------------------------------------
 Total inserted pairs = 13697194
------------------------------------------------------------
READING READS LIB20872:
------------------------------------------------------------
Total inserted pairs = 12879817
 ------------------------------------------------------------
READING READS LIB20873:
------------------------------------------------------------
Total inserted pairs = 15300189
------------------------------------------------------------

READING READS LIB20874:
------------------------------------------------------------
Total inserted pairs = 14841054
------------------------------------------------------------

 LIBRARY LIB20870 STATS:
 ################################################################################

 MAPPING READS TO CONTIGS:
 ------------------------------------------------------------
    Number of single reads found on contigs = 9753025
    Number of read-pairs used for pairing contigs / total pairs = 3549531 / 3566951
 ------------------------------------------------------------

  READ PAIRS STATS:
    Assembled pairs: 3549531 (7099062 sequences)
            Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 4480 +/-896): 10
            Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 10729
            Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 64997
            ---
            Satisfied in distance/logic within a given contig pair (pre-scaffold): 3400632
            Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 73163

  Total satisfied: 3400642        unsatisfied: 148889


    Estimated insert size statistics (based on 10 pairs):
            Mean insert size = 4542
            Median insert size = 4495

    REPEATS:
    Number of repeated edges = 290
    ------------------------------------------------------------

   ################################################################################

   LIBRARY LIB20871 STATS:
  ################################################################################

  MAPPING READS TO CONTIGS:
  ------------------------------------------------------------
    Number of single reads found on contigs = 12290589
    Number of read-pairs used for pairing contigs / total pairs = 4586970 / 4601094
  ------------------------------------------------------------

  READ PAIRS STATS:
    Assembled pairs: 4586970 (9173940 sequences)
            Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 11311 +/-2262.2): 1542
            Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 27522
            Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 283593
            ---
            Satisfied in distance/logic within a given contig pair (pre-scaffold): 3007045
            Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 1267268
            ---
    Total satisfied: 3008587        unsatisfied: 1578383


    Estimated insert size statistics (based on 1542 pairs):
            Mean insert size = 11334
            Median insert size = 12330

    REPEATS:
    Number of repeated edges = 1014
    ------------------------------------------------------------

    ################################################################################

    LIBRARY LIB20872 STATS:
    ################################################################################

   MAPPING READS TO CONTIGS:
   ------------------------------------------------------------
    Number of single reads found on contigs = 10180550
    Number of read-pairs used for pairing contigs / total pairs = 3716099 / 3727919
    ------------------------------------------------------------

   READ PAIRS STATS:
    Assembled pairs: 3716099 (7432198 sequences)
            Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 10278 +/-2055.6): 7460
            Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 33861
            Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 361430
            ---
            Satisfied in distance/logic within a given contig pair (pre-scaffold): 2322109
            Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 991239
            ---
    Total satisfied: 2329569        unsatisfied: 1386530


    Estimated insert size statistics (based on 7460 pairs):
            Mean insert size = 10576
            Median insert size = 10798

   REPEATS:
    Number of repeated edges = 1051
    ------------------------------------------------------------

    ################################################################################

    LIBRARY LIB20873 STATS:
    ################################################################################

    MAPPING READS TO CONTIGS:
    ------------------------------------------------------------
    Number of single reads found on contigs = 10697666
    Number of read-pairs used for pairing contigs / total pairs = 3877539 / 3888155
    ------------------------------------------------------------

    READ PAIRS STATS:
    Assembled pairs: 3877539 (7755078 sequences)


   Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 9012 +/-1802.4): 9902
            Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 37008
            Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 498990
            ---
            Satisfied in distance/logic within a given contig pair (pre-scaffold): 2340096
            Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 991543
            ---
    Total satisfied: 2349998        unsatisfied: 1527541


    Estimated insert size statistics (based on 9902 pairs):
            Mean insert size = 9690
            Median insert size = 9848

    REPEATS:
    Number of repeated edges = 1331
    ------------------------------------------------------------

   ################################################################################


   LIBRARY LIB20874 STATS:
   ################################################################################

   MAPPING READS TO CONTIGS:
   ------------------------------------------------------------
    Number of single reads found on contigs = 9151596
    Number of read-pairs used for pairing contigs / total pairs = 3228267 / 3237339
   ------------------------------------------------------------

   READ PAIRS STATS:
    Assembled pairs: 3228267 (6456534 sequences)
            Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 7179 +/-1435.8): 2610
            Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 37925
            Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 470137
            ---
            Satisfied in distance/logic within a given contig pair (pre-scaffold): 1645785
            Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 1071810
            ---
    Total satisfied: 1648395        unsatisfied: 1579872


    Estimated insert size statistics (based on 2610 pairs):
            Mean insert size = 7843
            Median insert size = 8037


    REPEATS:
    Number of repeated edges = 1382
   ------------------------------------------------------------

   ################################################################################

  SUMMARY:
  ------------------------------------------------------------
    Inserted contig file;
            Total number of contigs = 29194
            Sum (bp) = 235939786
                    Total number of N's = 96400
                    Sum (bp) no N's = 235843386
            GC Content = 39.77%
            Max contig size = 4482245
            Min contig size = 1000
            Average contig size = 8081
            N25 = 783140
            N50 = 303978
            N75 = 11149

    After scaffolding LIB20870:
            Total number of scaffolds = 28752
            Sum (bp) = 237056737
                    Total number of N's = 1213351
                    Sum (bp) no N's = 235843386
            GC Content = 39.77%
            Max scaffold size = 4482245
            Min scaffold size = 1000
            Average scaffold size = 8244
            N25 = 782168
            N50 = 301219
            N75 = 11151

    After scaffolding LIB20871:
            Total number of scaffolds = 26913
            Sum (bp) = 249614734
                    Total number of N's = 13771348
                    Sum (bp) no N's = 235843386
            GC Content = 39.77%
            Max scaffold size = 4482245
            Min scaffold size = 1000
            Average scaffold size = 9274
            N25 = 747146
     N50 = 263694
            N75 = 14044

    After scaffolding LIB20872:
            Total number of scaffolds = 25035
            Sum (bp) = 260210927
                    Total number of N's = 24367541
                    Sum (bp) no N's = 235843386
            GC Content = 39.77%
            Max scaffold size = 4482245
            Min scaffold size = 1000
            Average scaffold size = 10393
            N25 = 727794
            N50 = 218291
            N75 = 15269

    After scaffolding LIB20873:
            Total number of scaffolds = 22574
            Sum (bp) = 273433297
                    Total number of N's = 37590309
                    Sum (bp) no N's = 235842988
            GC Content = 39.77%
            Max scaffold size = 4482245
            Min scaffold size = 1000
            Average scaffold size = 12112
            N25 = 668175
            N50 = 181977
            N75 = 20104

    After scaffolding LIB20874:
            Total number of scaffolds = 20622
            Sum (bp) = 281665096
                    Total number of N's = 45822307
                    Sum (bp) no N's = 235842789
            GC Content = 39.77%
            Max scaffold size = 4482245
            Min scaffold size = 1000
            Average scaffold size = 13658
            N25 = 650995
            N50 = 160160
            N75 = 22056

  ------------------------------------------------------------

Does any one have a better understanding? Many thanks

SSPACE Scaffolding Discovar Genome Assembly • 3.2k views
ADD COMMENT
0
Entering edit mode

It seems to have increased you genome size by 50 Mb, most of which are just NNNN sequences. Because the genome size is now larger, N50 changes. Are you sure you estimated the library sizes correctly? Try redundans, it is a pipeline that uses SSPACE, and it configures it automatically for you.

ADD REPLY
0
Entering edit mode

Thanks apelin20. I will try what you suggested.

ADD REPLY
5
Entering edit mode
8.5 years ago
krsahlin ▴ 60

This is most likely happening because SSPACE scaffolds together many of the smaller contigs. Furtheremore, (most of) the scaffolds created are smaller than original N50, i.e. less than 303kbp, ant most likely consists of only a pair, or a few contigs. The added N's into these smaller scaffolds inflates the total sequence size, which decreases the N50.

Example: Say that you have 11 contigs of size 7, 5, 2, 1,1,1,1,1,1,1,1. Total contig assembly size: 22. N50: 5. Now, say that after scaffolding, all the contigs of size 1 has been scaffolded into pairs with gap of size 1, i.e., contig_size_1--gap_size_1--contig_size_1, making these scaffold all have size 3. We now have assembly stats of 7 scaffolds with lengths 7, 5, 3, 3, 3, 3, 2. Total assembly size 26. N50: 3. So assembly size has increased, N50 decreased.

This is studied in https://bioinformatics.oxfordjournals.org/content/early/2016/03/09/bioinformatics.btw064.full (our paper).

Whether the gaps introduced are correct or not for this assembly is difficult to tell. One way to get an intuition is to plot the gap sizes for the scaffolds (after each step, or all together), e.g. using this script https://github.com/ksahlin/genomics_tools/blob/master/bin/plot_gaps If all gaps for a given library is close the size of the insert size, e.g., gaps are close to 3kbp for the 3kbp library, and so on. This can suggest that the read painrs that are used to link contigs are only placed in the ends of the contigs --- an indicator of unreliable links. I do want to emphasize that this is only an indicator --- it might be that there is a really abundant repeat of this size creating lots of "holes" in scaffolds of this given size --- although probably less likely. Another way to evaluate the gaps is to use other scaffolders to see if the gap profile looks the similar.

A very interesting scaffolder, recently published, is the update of OPERA --- OPERA-LG http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y which can scaffold repetitive sequence into multiple places --- possibly removing some of the N's. May I also be so bold as to suggest our own scaffolder BESST, which has a recently published update (link given above), that usually output a lot less N's into scaffolds, and does not seem inflate assembly size to the same extent as SSPACE after scaffolding.

ADD COMMENT

Login before adding your answer.

Traffic: 1911 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6