Hi all,
I have assembled a fungal genome, 38 mb in size, using Flye. I filtered my reads to around 100x, but looking at the assembly graph in bandage, there are 3 nodes (700, 1000, and 2000 bases in size) which have a coverage of over 2000x. Running these in blast, I find that 1 is a ribosomal gene, and the other two are known transposable elements. As I fear these TEs may be causing a misassembly (judging by the tangles in the assembly graph), I want is to remove the reads that are the size of the nodes with excess coverage, i.e. the reads up to 700, 1000 or 2000 bases in length, but retain larger reads which would be the chromosomal regions with the matching regions from which these TEs would have originated. Using this script: minimap2 -ax map-ont ContaminatingNode1.fasta Reads.fasta' | samtools fasta -n -f 4 - > NoContaminationreads.fasta' I seem to have also removed the long reads, as when I align the NoContaminationreads.fasta to the assembled genome, there are no reads that span the contig where these high coverages TEs should be. Is there any way to remove only the reads up to a certain size, but retain the larger reads which would probably be chromosomal. IE, I want to perform an assembly and see the region in which these sequences map to have around 100x and no more.
Many tanks in advance
Best
Zack
Sounds like you got three good contig sequences that represent things you don't want from your data. Can you try and remove reads (or part of reads) that align to those sequences? You may have tried that already but that is not clear in your text.