Question

Assembly reads from only one file

0

Entering edit mode

9.6 years ago

nx8 ▴ 20

Hi,

I have about 10.000 files each containing reads that belong to haplotypes of different individuals and different genes. I would like to assemble all of them separately. Is there a way to do assemblies file by file without having to create separate directories? Or any other fast method? They are all quite small.

Many thanks!

Assembly fastq • 2.8k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.6 years ago by nx8 ▴ 20

Ram · Answer 1 · 2016-01-07

1

Entering edit mode

9.6 years ago

Brian Bushnell 20k

For haplotypes, the quickest method would be to write a loop in Bash, iterating over all files with Tadpole (from BBMap):

tadpole.sh in=reads.fq out=contigs.fa

...where reads.fq and contigs.fa are replaced by variables. If you don't know Bash, you can also drag/cut/paste from Excel into a shellscript using the filenames, generating lines like tadpole.sh in=reads174.fq out=contigs174.fa (which I prefer anyway since it leaves an explicit record of all commands). There's no faster way, and that will not generate any temp files or new directories. But I am curious as to how you got the reads into files by haplotype and gene. Because if each file does not contain all the reads for a particular gene and haplotype, this method will not work very well; and I am unaware of a good method to bin reads by gene and haplotype.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks a lot!

I mapped amplicon reads to the reference and did a snp calling, then I used a tool called hapcut for seperating haplotypes based on the snp calling and the mapping and readphaser to get the reads related to each haplotype. The phasing of haplotypes doesn't work for the complete sequence of a gene but it works in blocks which are long enough. I put the reads corresponding to one block of one haplotype (of one genotype and one gene) into one file. So the result was a few thousand files each containing reads that should assemble to one contig.

But actually they don't assemble to one contig in every case by using tadpole. I know though they should. I believe tadpole is quite strict? How can I make it less strict?

Thanks again!

ADD REPLY • link 9.6 years ago by nx8 ▴ 20

1

Entering edit mode

You can use a longer kmer, if you have sufficient data and long enough reads. Default is 31, and I recommend around 2/3rds of the read length. So, try k=100 if you have 150bp reads, or k=66 with 100bp reads, etc. That generally increases the continuity.

You can also set bm1=8 bm2=2 to reduce the stringency. So overall, tadpole.sh in=reads.fq out=contigs.fa k=66 bm1=8 bm2=2, for example. Additionally, adapter-trimming the reads prior to assembly is always a good idea (you can do that with BBDuk).

If you still end up with multi-contig assemblies, you might try Spades, which is less conservative and usually generates a more continuous assembly than Tadpole.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

I have 2X250 paired-end reads produced by Illumina-MiSeq sequencing. I set k to 167. It is better now but I still get multi-contig assemblies sometimes and sometimes no result at all. But I don't need the whole haplotype sequences of a gene, I just need long enough pieces to distinguish the two haplotypes and I think this works now. But I'll try Spades as well.

I'm very grateful for your help! Thanks a lot!!

ADD REPLY • link 9.6 years ago by nx8 ▴ 20

0

Entering edit mode

You're welcome!

In light of the fact that you are using 2x250bp reads, which are more likely to undershoot the insert-size target, I highly recommend you adapter-trim as detailed in this post. It's also often worth error-correcting your reads using a shorter kmer (such as 31) after adapter-trimming but before assembling, as 250bp reads often have a relatively high error rate toward the ends. Be sure to only error-correct within a haplotype.

bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
tadpole.sh in1=clean1.fq in2=clean2.fq out1=corrected1.fq out2=corrected2.fq mode=correct k=31

It's also often useful to merge the reads prior to assembly, if the insert size distribution is such that they mostly overlap:

bbmerge.sh in1=corrected1.fq in2=corrected2.fq out=merged.fq outu=unmerged.fq

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

my reads are already adapter- and quality trimmed. I've done that before mapping and calling variants.

I'm pulling the reads corresponding to the one phase or the other from the mappings so they all end up in one file. The PE-information is thus not going into the assembly I think. This probably affects the quality of the assembly but I"m not sure what to do about it right now.

When I merge I also loose a lot of reads. I did the correction step though.

ADD REPLY • link 9.6 years ago by nx8 ▴ 20

score 0 · Answer 2 · 2016-01-25

0

Entering edit mode

9.6 years ago

nx8 ▴ 20

.....................

ADD COMMENT • link 9.6 years ago by nx8 ▴ 20