Question

Removing Human Contigs From Metagenomic Shotgun Assembly (FASTA)

0

Entering edit mode

5.8 years ago

isu2017 • 0

Hi there,

I used SPADE to assemble my metagenome shotgun dataset into contigs. I just realized, however, that there is human contamination in this assembly. Because of how long the assembly took, I'm trying to think of ways to remove those human contigs from the FASTA assembly. Any suggestions? Now, if I need to go back a step, and remove them from the FASTQ files, how should I proceed? (I'd rather not use something like Kneaddata from removal of human contaminations btw.)

thanks!

Spade Metagenome Metagenomics • 4.1k views

ADD COMMENT • link updated 5.8 years ago by evoBio ▴ 50 • written 5.8 years ago by isu2017 • 0

4

Entering edit mode

You could simply align the data to human genome (use blat, LAST or LASTZ) and remove sequences that align.

If you are willing to go back to the original data then try: http://seqanswers.com/forums/showthread.php?t=42552

ADD REPLY • link 5.8 years ago by GenoMax 153k

1

Entering edit mode

BlobTools is great for this, although if you have too many contigs (hundred thousands or millions of contigs) the blast step may be too slow.

ADD REPLY • link updated 5.8 years ago by GenoMax 153k • written 5.8 years ago by h.mon 35k

0

Entering edit mode

In addition to good suggestions that are already part of this thread, I think you should look at all similar posts on the far right side of this page. This is a fairly common problem and has been debated already.

You may want to consider binning of your sequences with t-SNE or UMAP. Human contigs that are > 5kb should separate easily from other sequences.

ADD REPLY • link 5.8 years ago by Mensur Dlakic ★ 29k

score 1 · Answer 1 · 2019-10-25

Removing the host genome should be a part of your quality control step of your metagenomic pipeline. You can do this right after you quality trim your sequences. There are several ways to remove the host genome but I personally used BWA (Bowtie2 is another option) to align the reads to human genome. You will get two SAM or BAM files (aligned and unaligned) as output and you will take the unaligned SAM/BAM file and convert it to FASTA or FASTQ (I used Picard Tools here but you can also use SAMTools or BAMTools) to obtain non-human reads with which you will perform assembly. Regardless of whether you remove host reads or not, depending on the size of your data set SPADES can be a memory hog and take a while to run.