Question

Filter out specific reads from FASTQ files

1

Entering edit mode

10.0 years ago

Paul ★ 1.5k

Dear all,

I have pair-end RNA-seq data (Illumina) from parasite and I would like to do De-Novo assembly by TRINITY. I have reference genome of my host organism so I can map my data to host and remove from fastq contaminations.

My plan is:

Map with bwa/bowtie/novoaling my pair-end FASTQ files to a host reference genome
Remove hits from fastq files (cleaning contaminations)
For the rest of FASTQ files use TRINITY for De-Novo transcript assembly

My question is:

May I use aligners (bwa etc.) and align raw fastq files to host DNA and then remove contaminants from fastq files? Question is because my data are from RNA-seq project NOT DNA.

How can I remove the sequences from raw fastq files that align to host DNA (cleaning process)?

Or if you have any other advice how to prepare data to TRINITY pipeline I will appreciate it.

Thank you so much for any comment and sharing your experience.

De-Novo FASTQ filtering Illumina RNA-Seq • 6.8k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by Paul ★ 1.5k

score 6 · Accepted Answer · 2014-11-21

6

Entering edit mode

10.0 years ago

Devon Ryan 104k

If you have RNAseq data, you'd be better to stick with an aligner intended for spliced alignments (e.g. STAR). Most of these have an option to place unmapped reads/pairs in a new fastq file(s), which you could then feed to trinity or any other assembler (i.e, step #2 will be done for you). I don't have any advice on good assemblers, hopefully others will chime in with feedback there.

ADD COMMENT • link 10.0 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon, I will try STAR maybe TopHat and I'll see how does it work.

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

Ram · Accepted Answer · 2014-11-21

2

Entering edit mode

10.0 years ago

Manvendra Singh ★ 2.2k

I agree with Devon

I would do it in following ways:

Map fastq files with tophat2
Convert unmapped.bam file to fastq (bamTofatsq) and remap with tophat2, this time provide junctions ( with an option -j you got from first run, (if replicates then merge the junctions).
The unmapped.bam from this run can be converted to fastq.

I think that this is the fastq from which the reads you are looking for.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Thank you Manu for your comment. Why do you recommend mapping twice? Thank you for deeper explanation.

ADD REPLY • link 10.0 years ago by Paul ★ 1.5k

0

Entering edit mode

In the next step of mapping, you provide all the junctions from your RNA-seq data,

I have noticed that >5% of unaligned reads would be aligned on genome by doing so.

Now, you would have more robust mapped and unmapped reads, which you can follow up

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Manvendra Singh ★ 2.2k

Ram · Accepted Answer · 2014-11-27

If you would like a Galaxy solution, this filters by ID: http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id / https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_filter_by_id

This filters using a SAM/BAM mapping file: http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_mapping / https://testtoolshed.g2.bx.psu.edu/view/peterjc/sample_seqs / https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_filter_by_mapping