Hi,
I have about 10.000 files each containing reads that belong to haplotypes of different individuals and different genes. I would like to assemble all of them separately. Is there a way to do assemblies file by file without having to create separate directories? Or any other fast method? They are all quite small.
Many thanks!
Thanks a lot!
I mapped amplicon reads to the reference and did a snp calling, then I used a tool called hapcut for seperating haplotypes based on the snp calling and the mapping and readphaser to get the reads related to each haplotype. The phasing of haplotypes doesn't work for the complete sequence of a gene but it works in blocks which are long enough. I put the reads corresponding to one block of one haplotype (of one genotype and one gene) into one file. So the result was a few thousand files each containing reads that should assemble to one contig.
But actually they don't assemble to one contig in every case by using tadpole. I know though they should. I believe tadpole is quite strict? How can I make it less strict?
Thanks again!
You can use a longer kmer, if you have sufficient data and long enough reads. Default is 31, and I recommend around 2/3rds of the read length. So, try
k=100
if you have 150bp reads, ork=66
with 100bp reads, etc. That generally increases the continuity.You can also set
bm1=8 bm2=2
to reduce the stringency. So overall,tadpole.sh in=reads.fq out=contigs.fa k=66 bm1=8 bm2=2
, for example. Additionally, adapter-trimming the reads prior to assembly is always a good idea (you can do that with BBDuk).If you still end up with multi-contig assemblies, you might try Spades, which is less conservative and usually generates a more continuous assembly than Tadpole.
I have 2X250 paired-end reads produced by Illumina-MiSeq sequencing. I set k to 167. It is better now but I still get multi-contig assemblies sometimes and sometimes no result at all. But I don't need the whole haplotype sequences of a gene, I just need long enough pieces to distinguish the two haplotypes and I think this works now. But I'll try Spades as well.
I'm very grateful for your help! Thanks a lot!!
You're welcome!
In light of the fact that you are using 2x250bp reads, which are more likely to undershoot the insert-size target, I highly recommend you adapter-trim as detailed in this post. It's also often worth error-correcting your reads using a shorter kmer (such as 31) after adapter-trimming but before assembling, as 250bp reads often have a relatively high error rate toward the ends. Be sure to only error-correct within a haplotype.
It's also often useful to merge the reads prior to assembly, if the insert size distribution is such that they mostly overlap:
my reads are already adapter- and quality trimmed. I've done that before mapping and calling variants.
I'm pulling the reads corresponding to the one phase or the other from the mappings so they all end up in one file. The PE-information is thus not going into the assembly I think. This probably affects the quality of the assembly but I"m not sure what to do about it right now.
When I merge I also loose a lot of reads. I did the correction step though.