Need help to remove contaminants from genome reads
1
0
Entering edit mode
6.0 years ago
cecilio11 ▴ 120

Hello Biostars,

I am doing de novo assemblies of two bacteria. The sequences were obtained from pacbio. I am using Canu 1.7 & 1.8 in two different clusters. For one cluster (an Intel cluster) I am using Canu 1.8. For the other cluster (IBM) I am using Canu 1.7. The estimated size of the genomes are known.

I have some issues with the assemblies. It seems that there may be some DNA from undesired bacteria in the pacbio output (perhaps contamination of the bacterial cultures).

I was advised to use BWA-mem (https://github.com/lh3/bwa) to map the reads to the known contaminant (I got the contaminant bacteria sequence from GenBank). After mapping, I should discard the reads that map to the contaminant and use the un-mapped reads for the assembly of my desired bacteria. This sounds good for one bacteria.

For the other one, I have got only unplaced-contings for the contaminant (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1/GCF_001020485.1_ASM102048v1_assembly_report.txt) from ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Moellerella_wisconsensis/latest_assembly_versions/GCF_001020485.1_ASM102048v1

How to map my reads to such a set of contigs (108 unplaced ones)? Could you guide me to the tool to use for making those 108 unplaced contigs usable for mapping the contaminant reads to them?

If you could guide me to the right tools for performing the removal of the contaminant would be great. For example, is BWA-mem the best tool for the job? I have seen in the internet a tool from QIAgen, but is is a paid software https://www.qiagenbioinformatics.com/products/clc-genomics-workbench/.

Regards,

Cecilio1

assembly • 2.9k views
ADD COMMENT
0
Entering edit mode

Thank you, WouterDeCoster. I will take a look at it. It sounds promising.

ADD REPLY
0
Entering edit mode

WouterDeCoster,

The use of the tool you created seems straightforward. I will give it try as soon as the cluster at my place of work is up and running again.

However, for me, there is always the reference genome problem.

I have, from NCBI, some genomes that have unplaced contigs only. I understand that a complete reference genome is a single sequence of the bacterial genome. Most of the genomes I have from NCBI are not fully assembled yet.

Could you please, clarify this for me? Can I use such unplaced contigs in lieu of a reference genome, and if so, how to do it?

Best regards,

Cecilio

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

NanoLyse just needs a fasta file to which alignments will be checked. It doesn't matter if the genome is not fully assembled, separate contigs are fine. Not that reads from your genome of interest with high similarity to the genomes you are removing would be lost as well.

ADD REPLY
0
Entering edit mode

Dear WouterDeCoster,

Does NanoLyse only accepts compressed files? Is there an option to control the % similarity between the reference genome and the reads that are being screened? (minimap2 allows for that).

Regards, Cecilio

ADD REPLY
0
Entering edit mode

No, there is currently no such option. That might be a useful feature, but I won't have the time soon. I have created an issue with this as a feature request. As NanoLyse is internally based on minimap2 this should be possible to implement. Contributions are welcome.

ADD REPLY
0
Entering edit mode

Dear WouterDeCoster, Does NanoLyse only accepts compressed files? Cecilio

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

NanoLyse reads from stdin, so compression doesn't matter.

If your input is gzipped, using gunzip -c:

gunzip -c reads.fastq.gz | NanoLyse --reference contaminant.fasta > output.fastq

If your input is not compressed, simply use cat:

cat reads.fastq | NanoLyse --reference contaminant.fasta > output.fastq

However, often there is no good reason not to compress your data. Your hard drive will thank you.

ADD REPLY
0
Entering edit mode
6.0 years ago

I have written NanoLyse for that purpose. You can pipe your reads through, and those which aligned to a genome specified by --reference will be removed.

ADD COMMENT

Login before adding your answer.

Traffic: 1555 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6