Entering edit mode
7.7 years ago
Lucila
▴
20
Hi all,
I am filtering and trimming my RNA-seq data with fastq quality trimmer on FASTX toolkit, and I am wondering which minimum length I have to use, I mean, under which length it is convinient to discard the reads after the trimming. I have sequenced my samples with Illumina and my reads are 50bp long.
Thank you for all your comments!
Depends on what you want/need to do with your sequences. With sequences being so short, you're probably better off simply trimming by quality.
@st.ph.nIt - it sounds like he's already trimming by quality, but agreed, more details are needed to give a useful answer. So, please describe:
Also, FASTX is the bottom of the barrel in bioinformatics software. It's slow and uses non-optimal algorithms; I suggest you use virtually anything else.
Thanks Brian,
Do you have any alternative to FASTX tool?
If you need more info, please, let me know Lucila
Both seqtk and BBDuk use the optimal Phred algorithm for quality-trimming, and both are very fast. I'd recommend either of them, though for paired reads I think BBDuk is more convenient. Adapter-trimming is probably at least as useful as quality-trimming, and BBDuk allows adapter-trimming simultaneously with quality-trimming. Neither of them are overly important for 50bp shotgun reads, though - they become more important with full-length reads (150bp+).
As st.ph.n indicated, you can't get a good assembly with single-ended 50bp reads, especially for a diploid. But you can still get an assembly, which may be sufficient for differential expression; many of the genes won't be complete in a single contig, but their sequence will still be present, and you may be able to annotate them with function. Then comparison between different conditions could be useful. However, you'd be better off sequencing a sample deeply with 2x150bp reads (or ideally, 2x250) with relatively long inserts (say, 600bp average, with a wide range) for the denovo assembly, then using the 50bp data for quantification.
You can determine the GC content of your organism (to a first approximation) from the reads. For example, using BBDuk:
Notably, if that graph has two peaks, that may indicate your insect has a bacterial symbiote with different GC (cockroaches do, for example). Those could be assembled independently.
Thank you Brian for your response. Regarding the histogram, which software do you recommend to plot it in a graph?
I have visualized more than one peak with FASTQC. So perhaps it could be interesting to assemble them independently. Which tool allows me to organize the reads according to the GC content?
Sorry for all my begginer's questions.
Best, Lucila.
You can filter by gc content with BBDuk's mingc and maxgc flags, but that's a pretty crude way of filtering, since the peaks will overlap. It's usually better to filter the assembled contigs by gc since they are longer.
As for plotting, I normally use Excel.
Assuming you have longer paired-end reads in order create your de novo assembly, I suggest you use Trinity, incorporating your shorter (assumably single-end) reads in the assembly. As far as trimming by quality, incorporated the trimmomatic command into your Trinity command.
Thanks for your answer st.ph.n. Unfortunately we only have these single-end reads 50 bp lenght. Do you think it is possible to make a de novo assembly with them? We selected the cheaper option but maybe it was not the best....
Ideally, you want longer paired-end reads to create an assembly.