Hi all,
I have a DISCOVAR de novo assembly that was made with 2x250 bp paired end reads with a fragment size of ~800. Since I also have a couple 2x150 bp mate pair libraries (insert sizes 1.5 kb and 4 kb), I now want to scaffold the DDN assembly. To do this I am trying SSPACE, but I also want to try the ABySS pipeline (everything after the ABYSS/ABYSS-P command). Everything works up until the first mapping step:
AdjList -v -k200 -m50 --dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.dot
abyss-filtergraph -v --dot -k200 -g SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.dot SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-1.path
MergeContigs -v -k200 -g SXPX_DDN_BROKEN-2.dot -o SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-1.fa SXPX_DDN_BROKEN-2.dot1 SXPX_DDN_BROKEN-1.path
PopBubbles -v --dot -j20 -k200 -p0.9 -g SXPX_DDN_BROKEN-3.dot SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot >SXPX_DDN_BROKEN-2.path
MergeContigs -v -k200 -o SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-2.fa SXPX_DDN_BROKEN-2.dot SXPX_DDN_BROKEN-2.path
awk '!/^>/ {x[">" $1]=1; next} {getline s} $1 in x {print $0 "\n" s}' \
SXPX_DDN_BROKEN-2.path SXPX_DDN_BROKEN-1.fa >SXPX_DDN_BROKEN-indel.fa
ln -sf SXPX_DDN_BROKEN-3.fa SXPX_DDN_BROKEN-unitigs.fa
abyss-map -v -j20 -l200 ../02_Separated_Pairs/Library.SXPX.L.fq.gz ../02_Separated_Pairs/Library.SXPX.R.fq.gz SXPX_DDN_BROKEN-3.fa \
|abyss-fixmate -v -l200 -h pe1-3.hist \
|sort -snk3 -k4 \
|DistanceEst -v -j20 -k200 -l200 -s500 -n10 -o pe1-3.dist pe1-3.hist
Here is the corresponding output:
Reading `SXPX_DDN_BROKEN-1.fa'...
Finding overlaps of exactly k-1 bp...
V=353130 E=130474 E/V=0.369
Degree: █▂
01234
0: 70% 1: 24% 2-4: 6.2% 5+: 0% max: 4
Finding overlaps of fewer than k-1 bp...
V=353130 E=168079956 E/V=476
Degree: ▅█▁
01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Loading graph from file: SXPX_DDN_BROKEN-1.dot
Graph stats before:
V=353130 E=168079956 E/V=476
Degree: ▅█▁
01234
0: 20% 1: 30% 2-4: 14% 5+: 37% max: 8203
Removing shim contigs from the graph...
Pass 1: Checking 9670 contigs.
Pass 2: Checking 17 contigs.
Shim removal stats:
Removed: 4498 Too Complex: 55937 Tails: 55781 Too Long: 59860 Self Adjacent: 488 Parallel Edges: 72
Reading `SXPX_DDN_BROKEN-1.fa'...
Edge removal stats:
Removed: 0
Graph stats after:
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot1'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-1.fa'...
Read 172066 sequences. Using 2.18 GB of memory.
Reading `SXPX_DDN_BROKEN-1.path'...
Read 0 paths. Using 2.18 GB of memory.
Writing `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
V=344132 E=168070667 E/V=488
Degree: ▆█▂_
01234
0: 20% 1: 28% 2-4: 14% 5+: 38% max: 8203
Reading `SXPX_DDN_BROKEN-2.fa'...
Bubbles: 534 Popped: 388 Scaffolds: 0 Complex: 85 Too long: 0 Too many: 45 Dissimilar: 16
V=294910 E=168020663 E/V=570
Degree: █▆▃__
01234
0: 23% 1: 17% 2-4: 16% 5+: 44% max: 8203
Reading `SXPX_DDN_BROKEN-2.dot'...
Read 344132 vertices. Using 1.86 GB of memory.
Reading `SXPX_DDN_BROKEN-2.fa'...
Read 172066 sequences. Using 2.08 GB of memory. Reading `SXPX_DDN_BROKEN-2.path'...
Read 20613 paths. Using 2.09 GB of memory.
The minimum coverage of single-end contigs is inf.
The minimum coverage of merged contigs is inf.
n n:200 L50 min N80 N50 N20 E-size max sum name
147455 147455 12083 300 751 3925 10323 34989 1317114 207.5e6 SXPX_DDN_BROKEN-3.fa
Reading from standard input...
Reading `SXPX_DDN_BROKEN-3.fa'...
Using 18.6 MB of memory and 126 B/sequence.
Reading `SXPX_DDN_BROKEN-3.fa'...
Building the suffix array...
Building the Burrows-Wheeler transform...
Building the character occurrence table...
Read 212 MB in 147455 contigs.
Using 1.88 GB of memory and 8.87 B/bp.
Read 15 alignments. Hash load: 5 / 8 = 0.625 using 975 kB.
Read 131 alignments. Hash load: 9 / 16 = 0.5625 using 975 kB.
Read 1000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
Read 2000000 alignments. Hash load: 4 / 16 = 0.25 using 975 kB.
[............]
Read 499000000 alignments. Hash load: 4 / 32 = 0.125 using 1.22 MB.
Mapped 302150492 of 499073402 reads (60.5%)
Mapped 299438060 of 499073402 reads uniquely (60%)
Read 499073402 alignments
Mateless 0
Unaligned 56417991 22.6%
Singleton 84086928 33.7%
FR 92547076 37.1%
RF 7349 0.00295%
FF 23043 0.00923%
Different 16454314 6.59%
Total 249536701
FR Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096 ignored: 14680
_____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Mate orientation FR: 92547076 (100%) RF: 7349 (0.00794%)
The library pe1-3.hist is oriented forward-reverse (FR).
Stats mean: 740.3 median: 796 sd: 149.4 n: 92539745 min: 227 max: 1096
_____________▁▁▁▁▁▁▁▂▂▂▂▂▃▃▄▅▇███▇▆▅▃▂▁_
Minimum and maximum distance are set to -199 and 1096 bp.
error: input must be sorted: saw `flattened_line_137693_1' before `flattened_line_86968_1'
Everything after this point does not work properly because the file "pe1-3.dist" is empty. Clearly, I think the input for DistanceEst is not properly sorted by the "sort" command, which likely has to do with the difference between the fasta sequence names in the DDN output (e.g. >flattened_line_0_1
) versus ABYSS output (e.g. >0 100 315
).
Can I modify the sort command some how to work properly with the DDN fasta names? Or should I rename all the fasta sequences in the DDN output to something more similar to the ABYSS format? I'm not sure how exactly the "sort" command is working, so I'm not sure how the input needs to be formatted. Any suggestions are appreciated.
Best,
Dan
I suppose it's worth mentioning that the version of ABySS that I am using to do this was compiled with
--enable-maxk=256
so using k=200 isn't a problem.