Assembly with mate-pairs issue
1
0
Entering edit mode
8.6 years ago

hi,

I'm conducted an assembly of a genome and during the testing phase (different inputs, K-mer and such) I made an observation that worries me a little. here is the situation: I have 2 assembly results using the same paired and single end read input data set but in one of them I added additional mate-pair data to get (better) scaffolding. According to the progress log and such both runs finished without any issues. When I now compare the results, the stats, I noticed that the assembly of the run without the mate-pair gives seriously better stats??

run1 (without mate pair):

n       n:500   L50     min     N80     N50     N20     E-size  max    sum     name
75.35e6 328297  142964  500     525     580     698     632     3857    198.7e6 Test-unitigs.fa
75.33e6 330804  143107  500     525     582     709     649     9353    202.1e6 Test-contigs.fa
75.33e6 330732  141848  500     525     583     711     657     19614   202.5e6 Test-scaffolds.fa

run2 (with mate-pair info)

n       n:500   L50     min     N80     N50     N20     E-size  max     sum     name
75.35e6 328297  142964  500     525     580     698     632     3857    198.7e6 Test-unitigs.fa
75.33e6 330804  143107  500     525     582     709     649     9353    202.1e6 Test-contigs.fa
75.33e6 330804  143107  500     525     582     709     649     9353    202.1e6 Test-scaffolds.fa

From what I can see it looks like in the second run he did not even do any scaffolding? The mate-pairs I'm using are derived from 454 data. To some extent I understand that the 454 mate pairs are not adding much additional info (but looking at the alignment results for those libraries it should have added info), but what I do not understand is that I don't get at least what I got from solely using the paired-end data.

Is it possible that ABySS when mate-pair data is provided does not do any scaffolding with the paired-ends all together (and thus only uses the mate-pair info)?

Assembly abyss mate-pairs • 2.9k views
ADD COMMENT
0
Entering edit mode

Is it possible that ABySS when mate-pair data is provided does not do any scaffolding with the paired-ends all together (and thus only uses the mate-pair info)?

Yes, that is possible. Please report the exact command line that you used for both assemblies.

ADD REPLY
0
Entering edit mode

Here is the one for the run with the MP data:

abyss-pe np=50 -C k$1 name=ppinTk$1 k=$1 \
lib='D1MNAACXX_2 D1MNAACXX_3 D1MNAACXX_6 D1MNAACXX_7 D1MNAACXX_8' \
mp='C1MLEACXX_3 C1MLEACXX_4' \
D1MNAACXX_2="$inDIR/D1MNAACXX_2_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_2_0_2.cleanTrim.fq.gz" D1MNAACXX_3="$inDIR/D1MNAACXX_3_0_1.
cleanTrim.fq.gz $inDIR/D1MNAACXX_3_0_2.cleanTrim.fq.gz" D1MNAACXX_6="$inDIR/D1MNAACXX_6_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_6_0
_2.cleanTrim.fq.gz" D1MNAACXX_7="$inDIR/D1MNAACXX_7_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_7_0_2.cleanTrim.fq.gz" D1MNAACXX_8="$in
DIR/D1MNAACXX_8_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_8_0_2.cleanTrim.fq.gz" \
se="$inDIR/D1MNAACXX_2_0.singl.merged.fq.gz $inDIR/D1MNAACXX_3_0.singl.merged.fq.gz $inDIR/D1MNAACXX_6_0.singl.merged.fq.gz $inD
IR/D1MNAACXX_7_0.singl.merged.fq.gz $inDIR/D1MNAACXX_8_0.singl.merged.fq.gz" \
C1MLEACXX_3="$inDIR/MP_illumina/C1MLEACXX_3_0_1.clean.fq.gz $inDIR/MP_illumina/C1MLEACXX_3_0_2.clean.fq.gz" C1MLEACXX_4="$inDIR/
MP_illumina/C1MLEACXX_4_0_1.clean.fq.gz $inDIR/MP_illumina/C1MLEACXX_4_0_2.clean.fq.gz"

and here is the one without MP data:

abyss-pe np=50 -C k$1 name=ppinTk$1 k=$1 \
lib='D1MNAACXX_2 D1MNAACXX_3 D1MNAACXX_6 D1MNAACXX_7 D1MNAACXX_8' \
D1MNAACXX_2="$inDIR/D1MNAACXX_2_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_2_0_2.cleanTrim.fq.gz" D1MNAACXX_3="$inDIR/D1MNAACXX_3_0_1.
cleanTrim.fq.gz $inDIR/D1MNAACXX_3_0_2.cleanTrim.fq.gz" D1MNAACXX_6="$inDIR/D1MNAACXX_6_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_6_0
_2.cleanTrim.fq.gz" D1MNAACXX_7="$inDIR/D1MNAACXX_7_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_7_0_2.cleanTrim.fq.gz" D1MNAACXX_8="$in
DIR/D1MNAACXX_8_0_1.cleanTrim.fq.gz $inDIR/D1MNAACXX_8_0_2.cleanTrim.fq.gz" \
se="$inDIR/D1MNAACXX_2_0.singl.merged.fq.gz $inDIR/D1MNAACXX_3_0.singl.merged.fq.gz $inDIR/D1MNAACXX_6_0.singl.merged.fq.gz $inD
IR/D1MNAACXX_7_0.singl.merged.fq.gz $inDIR/D1MNAACXX_8_0.singl.merged.fq.gz"

I mainly noticed the difference because for the one with the MP data I don't see any alignments of the PE files against the intermediate assembly only abyss-map output for the MP files.

ADD REPLY
4
Entering edit mode
6.9 years ago
Shaun Jackman ▴ 420

The default values for mp is pe, and the default value for pe is lib. You want to use
mp='D1MNAACXX_2 D1MNAACXX_3 D1MNAACXX_6 D1MNAACXX_7 D1MNAACXX_8 C1MLEACXX_3 C1MLEACXX_4'.

Do you have have five different paired-end libraries, or five lanes of one library? If it's the latter, you ought to put all the lanes in a single library. Ditto for your 454 data.

lib='pe1'
mp='pe1 mp1'
pe1='ALL_ILLUMINA_FQ_GZ_FILES'
mp1='ALL_454_FQ_GZ_FILES'

ABySS has not been tested with 454 data. It may be possible to use it, but it may require some experimentation. The mates must be oriented either forward-reverse (like paired-end) or reverse-forward (like mate-pair), but not forward-forward for ABySS.

Cheers,
Shaun

ADD COMMENT
0
Entering edit mode

Thx for the reply.

ah, good question, I'll check (if I have access to that kind of detailed info) . This is nonetheless only a subset of my total input data (doing this currently to optimise kmer etc and to test protocol). I certainly have mate pair data of diff libs (and to some extent also for the paired end, though not sure which might be derived from the same lib?)

so if I understand correctly I need to 'repeat' the pe library name in the 'mp' as well if I want to also inlcude them in the scaffolding stage, right? if there is no mp data given abyss will be default fall back on the pe data for the scaffolding stage. That would indeed explain my observation.

Yes, I'm aware that 454 data is not preferential and I only have very little coverage of it so perhaps I just need to omit it ... I'll evaluate it.

ADD REPLY
0
Entering edit mode

Hi Shaun,

OK, this works perfectly as expected! before I close this thread as resolved, one small additional question. You mention (or recommend) to concatenate all the data from a single lib (potentially from multiple lanes) into a single input file. Can you elaborate on why? Is it a technical reason, I can't immediately see an advantage to put them into a single file over x separate files.

thx

ADD REPLY

Login before adding your answer.

Traffic: 2559 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6