Good morning everyone,
This is my first question in Biostars, so please bear with me if it is not well structured.
I am currently processing raw Illumina paired end reads to proceed with the assembly of 96 chloroplast genomes.
Whole genome shotgun libraries were prepared using the NEBNext Ultra II FS DNA Library Prep Kit for Illumina combined with NEBNext Dual Index Primers Set II, following the manufacture protocol, with an input of 200 ng of DNA, and 25 min of fragmentation time aiming to obtain fragments ranging 150-350 bp.
We then sent the 96 libraries to be sequenced with Illumina HiSeq X 150x150 bp and received confirmation that the fragment sizes of the pooled library was within the expected range.
I then checked the quality of the raw reads with FastQC and everything seems alright, except for the presence of Illumina adapters. Here a snapshot of the FastQC report for the forward read and for one sample:
After this I decided to proceed to Trimmomatic and filter the reads by average quality, keeping an average quality of 20. Here is the command I used:
java -jar /home/bland/Programs/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 -trimlog Trimlog L01_F.fq L01_R.fq L01_PF9.fq L01_UF9.fq L01_PR9.fq L01_UR9.fq ILLUMINACLIP:adapterIL.fa:2:30:10 AVGQUAL:20
For the command above I customized an adapter file, but I also ran the same command using the adapter file TruSeq3-PE-2.fa which is provided by Trimmomatic and the results were the same.
After running the command, most of the reads are retained, but around 80% are visibly shortened:
This means that when I add a MINLEN filter to the Trimmomatic command, let's say to retain reads that are 130-150bp long, I will retain only 20% of my initial raw reads.
Example of command:
java -jar /home/bland/Programs/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 -trimlog Trimlog L01_F.fq L01_R.fq L01_PF9.fq L01_UF9.fq L01_PR9.fq L01_UR9.fq ILLUMINACLIP:adapterIL.fa:2:30:10 AVGQUAL:20 MINLEN:130
I isolated those short reads left after trimming using AVGQUAL only as filter. There seem to be some contamination from Illumina primers within the reads, meaning that before and after the primer sequence I still get some more bases, but this is not for all reads in this group.
Did someone have a similar problem and what could be the cause?
I am not a biologist by formation and it is the first time that I work with this kind of data. I think I am missing out something that could be going on during the sequencing process. Maybe primer dimers were formed or the insert size was shorter than 150bp?
Any help is appreciated, Beatrice
The fact that you retain only 20% of the reads when applying minlen after adapter-clipping fits perfectly with the "Adapter-Content" graph that shows that 80% of your reads carry the Universal adapter.
My guess would be that probably your chloroplast DNA extraction/prep failed or is too contaminated with some kind inhibitory substance, so all you got after the library prep seems to be entirely adapter-dimers. Or your fragentation did not work as expected (I guess the FS kit uses enzymatic fragmentation?)?
Did you check the purity of your DNA with nanodrop and the quantity with a fluorometer (e.g. Quibit)?
Did you check your fragmented DNA on the Bioanalyzer after fragmentation but BEFORE continuing with the library prep?
Are you sure about that? Looking at the sequence length distribution graph above (which is after the trimming?) things don't appear to be so dire. Most of your reads still seem to be of length 150.
What is the aim of this experiment? Are you re-sequencing or trying to do de novo assembly. If former you should still be able to get reasonable results.
Thank you for the replies!
JV: I did not do the labwork, but what I know is that the quality and fragmentation of the DNA (extracted with cTAB) were checked with a Quantus machine and via electrophoresis in agar gel. The quality and fragmentation of the libraries were checked again with a Quantus machine and a TapeStation. So I think the answer to your questions is yes in both cases.
genomax: The aim of the project is to assemble the chloroplast genomes for different populations of a wild species using the chloroplast genome of its cultivated relative as a reference (already available). Despite this, I wanted to assemble one genome de novo for the wild species to double check whether there may be structural differences between the two species.The species are strictly related and probably hybridised in the past. I am sure I retain roughly 20% of the reads after filtering with MINLEN:130 or MINLEN:140, here an example:
That is a pretty severe cut-off and would indeed result in 80% loss of data. I suggest that you start with MINLEN:50. You may even be able to get by with something lower but since you are going to de novo assemble things it may be good to start there.
Sounds like you may have related chloroplast genomes available so doing a reference based assembly may also make things easier. Does this data have only chloroplast DNA or is there some genomic DNA as well? Having extraneous DNA may make the assembly harder. If you have a large amount of coverage you may need to normalize/downsample your data to get good assemblies.
In any case be sure there is no extraneous sequence present before you assemble. I suggest taking a look at
bbduk.sh
from BBMap suite as an alternate scan/trimming program to see what it finds.Thanks, I'll try to use BBDuk for comparison with Trimmomatic and map the trimmed reads between 30 and 130 to the cultivated species chloroplast genome to check if I can pool something out of those.
If I get anything good I'll give an update.
The explanation by @JV is correct. The Adapter Content graph before trimming indicates that majority of your inserts are significantly shorter than your read length (some as short as 30bp). This should have been obvious to the bench scientist who assessed the library by TapeStation.
@genomax is correct that a smaller MINLEN value would preserve more of your data, but that should not be necessary. 20% of 5M reads @ 140bp trimmed = 140M bp of sequence. For a typically sized chloroplast (~150Kbp), that's still greater than 900X coverage - more than sufficient for de novo assembly.
Thank you. Regarding your reasoning about the coverage, I thought the same, but then I was told that many reads won't belong to the chloroplast genome (they may as well be from the nuclear genome), so that calculation is in the ideal situation where all reads are actually coming from the chloroplast. Although I think that being the ideal coverage so high, the actual would still be good.