Hi,
I have a weird situation. I am trying to scaffold a genome assembly using few Illumina mate pair libraries. The original assembly scaffold N50 is 303KB but after scaffolding using SSPACE the N50 reduces close to 160KB. I dont understand why this is happening? What confuses me is that with every mate pair library the scaffolder outputs some good numbers under Satisfied in distance/logic within a given contig pair (pre-scaffold). Here is the whole scaffolding run log:
READING READS LIB20870:
------------------------------------------------------------
Total inserted pairs = 10654771
------------------------------------------------------------
READING READS LIB20871:
------------------------------------------------------------
Total inserted pairs = 13697194
------------------------------------------------------------
READING READS LIB20872:
------------------------------------------------------------
Total inserted pairs = 12879817
------------------------------------------------------------
READING READS LIB20873:
------------------------------------------------------------
Total inserted pairs = 15300189
------------------------------------------------------------
READING READS LIB20874:
------------------------------------------------------------
Total inserted pairs = 14841054
------------------------------------------------------------
LIBRARY LIB20870 STATS:
################################################################################
MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 9753025
Number of read-pairs used for pairing contigs / total pairs = 3549531 / 3566951
------------------------------------------------------------
READ PAIRS STATS:
Assembled pairs: 3549531 (7099062 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 4480 +/-896): 10
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 10729
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 64997
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 3400632
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 73163
Total satisfied: 3400642 unsatisfied: 148889
Estimated insert size statistics (based on 10 pairs):
Mean insert size = 4542
Median insert size = 4495
REPEATS:
Number of repeated edges = 290
------------------------------------------------------------
################################################################################
LIBRARY LIB20871 STATS:
################################################################################
MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 12290589
Number of read-pairs used for pairing contigs / total pairs = 4586970 / 4601094
------------------------------------------------------------
READ PAIRS STATS:
Assembled pairs: 4586970 (9173940 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 11311 +/-2262.2): 1542
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 27522
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 283593
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 3007045
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 1267268
---
Total satisfied: 3008587 unsatisfied: 1578383
Estimated insert size statistics (based on 1542 pairs):
Mean insert size = 11334
Median insert size = 12330
REPEATS:
Number of repeated edges = 1014
------------------------------------------------------------
################################################################################
LIBRARY LIB20872 STATS:
################################################################################
MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 10180550
Number of read-pairs used for pairing contigs / total pairs = 3716099 / 3727919
------------------------------------------------------------
READ PAIRS STATS:
Assembled pairs: 3716099 (7432198 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 10278 +/-2055.6): 7460
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 33861
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 361430
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 2322109
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 991239
---
Total satisfied: 2329569 unsatisfied: 1386530
Estimated insert size statistics (based on 7460 pairs):
Mean insert size = 10576
Median insert size = 10798
REPEATS:
Number of repeated edges = 1051
------------------------------------------------------------
################################################################################
LIBRARY LIB20873 STATS:
################################################################################
MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 10697666
Number of read-pairs used for pairing contigs / total pairs = 3877539 / 3888155
------------------------------------------------------------
READ PAIRS STATS:
Assembled pairs: 3877539 (7755078 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 9012 +/-1802.4): 9902
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 37008
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 498990
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 2340096
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 991543
---
Total satisfied: 2349998 unsatisfied: 1527541
Estimated insert size statistics (based on 9902 pairs):
Mean insert size = 9690
Median insert size = 9848
REPEATS:
Number of repeated edges = 1331
------------------------------------------------------------
################################################################################
LIBRARY LIB20874 STATS:
################################################################################
MAPPING READS TO CONTIGS:
------------------------------------------------------------
Number of single reads found on contigs = 9151596
Number of read-pairs used for pairing contigs / total pairs = 3228267 / 3237339
------------------------------------------------------------
READ PAIRS STATS:
Assembled pairs: 3228267 (6456534 sequences)
Satisfied in distance/logic within contigs (i.e. -> <-, distance on target: 7179 +/-1435.8): 2610
Unsatisfied in distance within contigs (i.e. distance out-of-bounds): 37925
Unsatisfied pairing logic within contigs (i.e. illogical pairing ->->, <-<- or <-->): 470137
---
Satisfied in distance/logic within a given contig pair (pre-scaffold): 1645785
Unsatisfied in distance within a given contig pair (i.e. calculated distances out-of-bounds): 1071810
---
Total satisfied: 1648395 unsatisfied: 1579872
Estimated insert size statistics (based on 2610 pairs):
Mean insert size = 7843
Median insert size = 8037
REPEATS:
Number of repeated edges = 1382
------------------------------------------------------------
################################################################################
SUMMARY:
------------------------------------------------------------
Inserted contig file;
Total number of contigs = 29194
Sum (bp) = 235939786
Total number of N's = 96400
Sum (bp) no N's = 235843386
GC Content = 39.77%
Max contig size = 4482245
Min contig size = 1000
Average contig size = 8081
N25 = 783140
N50 = 303978
N75 = 11149
After scaffolding LIB20870:
Total number of scaffolds = 28752
Sum (bp) = 237056737
Total number of N's = 1213351
Sum (bp) no N's = 235843386
GC Content = 39.77%
Max scaffold size = 4482245
Min scaffold size = 1000
Average scaffold size = 8244
N25 = 782168
N50 = 301219
N75 = 11151
After scaffolding LIB20871:
Total number of scaffolds = 26913
Sum (bp) = 249614734
Total number of N's = 13771348
Sum (bp) no N's = 235843386
GC Content = 39.77%
Max scaffold size = 4482245
Min scaffold size = 1000
Average scaffold size = 9274
N25 = 747146
N50 = 263694
N75 = 14044
After scaffolding LIB20872:
Total number of scaffolds = 25035
Sum (bp) = 260210927
Total number of N's = 24367541
Sum (bp) no N's = 235843386
GC Content = 39.77%
Max scaffold size = 4482245
Min scaffold size = 1000
Average scaffold size = 10393
N25 = 727794
N50 = 218291
N75 = 15269
After scaffolding LIB20873:
Total number of scaffolds = 22574
Sum (bp) = 273433297
Total number of N's = 37590309
Sum (bp) no N's = 235842988
GC Content = 39.77%
Max scaffold size = 4482245
Min scaffold size = 1000
Average scaffold size = 12112
N25 = 668175
N50 = 181977
N75 = 20104
After scaffolding LIB20874:
Total number of scaffolds = 20622
Sum (bp) = 281665096
Total number of N's = 45822307
Sum (bp) no N's = 235842789
GC Content = 39.77%
Max scaffold size = 4482245
Min scaffold size = 1000
Average scaffold size = 13658
N25 = 650995
N50 = 160160
N75 = 22056
------------------------------------------------------------
Does any one have a better understanding? Many thanks
It seems to have increased you genome size by 50 Mb, most of which are just NNNN sequences. Because the genome size is now larger, N50 changes. Are you sure you estimated the library sizes correctly? Try redundans, it is a pipeline that uses SSPACE, and it configures it automatically for you.
Thanks apelin20. I will try what you suggested.