(EDITED 8/4/23 for clarity)
I recently miniprepped a supernatant from lysed E coli (due to prophage induction) and decided to submit the sample for NGS (2 x 150bp). (Note that this sample was pooled with similarly lysed samples from other species to get the most out of my sequencing run)
When the .fastq data came back, I trimmed the bulk data (trimmomatic), mapped reads back to individual genomes using BWA. Used samtools to get a .bam file, sorted it, then visualized on IGV just to have a sense of how the reads mapped.
In the picture below where the reads were mapped back to E coli, the first row is the 'coverage' map. And the second row is the actual reads. I am only showing a representative 20kb segment of the genome, and you can clearly see some regions getting far better coverage, wherereas some regions have completely no reads. This was almost throughout the genome when I observed by eye. I was wondering if this is normal? The reference genome I used for read mapping was data I got back also from nanopore WGS from a company, so it should be good
However in the picture mapped back to B subtilis (B subtilis was also prophage-lysed and miniprepped in a similar manner prepared in the same way above and pooled with the E coli sample for submission), the data looks much better:
This does not look like WGS data. The coverage should not be sparse like this. What is the expected size of the genome and how much data (in terms of gigabases) did you end up with after trimming?
Did you actually do a simple mini-prep and not do any other purification/effort to keep the genomic DNA intact/high-MW? Did you make the libraries or did your sequence provider make the libraries?
Thanks for the reply!
Expected size of genome is 4.2Mbp (E coli) and this particular sample resulted in a sorted .bam file of 820mb. Because this sample was pooled together with other samples from other species, the data came back as a bulk and was trimmed together. So I cant say for sure how much of the trimming was due to this E coli sample. However the bulk fastq files were trimmed from 1.8gb/1.8gb(f+r) to 1.6gb/1.6gb(f+r)
I apologize maybe I used the wrong term. This sample was actually from bacteria supernatant (containing lysed bacteria due to prophage induction), which I cleaned up using a total DNA kit (https://www.zymoresearch.com/products/quick-dna-miniprep). I did the library prep myself using a NEBNext Ultra kit.
My goal was actually to see how much DNA mapped back to prophage and how much mapped back to backbone DNA.
What does this mean? Your sample did not have its own index? I hope it had its own index otherwise any analysis would be meaningless, if you had a mixed sample.
If you had the correct references you should have been able to find the answer to this.
The top screenshot you posted above is misleading and does not provide a complete picture. What do other areas of the genome look like?
Edit: Did you add the second screenshot after you did the original posting. I don't recall seeing that before. But that is how WGS data should look.
Each sample did indeed have its own index, so they were separated out.
The other areas look similar, shown above is a representative portion, but observing by eye, most regions had this sort of spotty mapping.
Yes after your post I edited my original post to provide more information. Indeed the second read mapping was good. I was surprised since I followed the exact same workflow for both samples. I wonder whether it means the reference I'm mapping to in the first picture is bad...?
This is puzzling. Can you post result of
samtools idxstats your.bam
? How many reads did you have (number and length) and how many of them mapped to the genome (%). E. coli is not an odd bacterium and your reads should align at a relatively high % not matter which E. coli genome you use.I also used samtools stats and visualized some of the data. Here is what the table looked like for E coli (spotty mapping in the picture in original post), B subtilis (good mapping in the picture in original post), and S typhimurium (not shown in original post, but it was another sample that had phenomenal mapping)
This is getting stranger. < 10% mapping is not great for a WGS sample (unless it is grossly contaminated with something else or is a metagenomic sample etc). What the heck are the rest of the 90% reads? Did you take few and blast @NCBI to check?
With WGS we expect the opposite to be normally true. 90+% reads mapping and the remaining smaller fraction not.
Which aligner did you use and are these paired end reads?
The way the samples are prepped were not true WGS minipreps. They were supernatants taken from lysed bacteria --> cleaned up with a generic DNA cleanup kit --> pooled with other samples from other species (e.g E coli samples were pooled together with B subtilis and Salmonella typhimurium) and sequenced. The bulk fastq data was mapped back only to the source genome though.
I guess of the bulk pooled 10m reads, 10% went to E coli, 10% went to B subtilis and 50% went to salmonella...in that sense it is a 'metagenomic' sample in that it was artifically pooled.
There were 2x150bp paired ends reads. I used bwa-mem aligner
I see. I thought the tables above were discreet stats for three samples. So the total you are showing above is for all reads. This is not how we generally represent data for independent "samples".
Sorry to be pedantic but did these samples have their own indexes and were separate library preps?
It sounds to me like when you say "bulk fastq" you seem to be referring to a pool of all samples. Keep in mind that aligners will make best effort to align sequence data. If the data contains reads that originated from a different genome they may still be aligned by the aligner to a heterologous genome. The aligner has no way of discriminating beyond the alignment it sees.
Considering all this it seems that you are only getting parts of the genome showing up in the supernatant for E coli, whereas the other two genomes seem to be represented well. Reason? Non efficient lysis of E coli?
This is my first time trying to troubleshoot these sorts of experiment so I really appreciate all the questions thank you! The samples for which I am showing data for did not have their own index. The plan was to differentiate the samples via mapping them to different genomes, with the assumption only E coli reads (in the 'bulk fastq') will be mapped back to E coli, etc.
I know I mentioned indexed earlier one, but use of index for this experiment was only for a different treatment group. Within each treatment group, no index was used. (Again, with the thought to differentiate them by genome)
I see...perhaps this is where maybe I had an incorrect assumption. I assumed E coli reads would only map back to E coli, since it is perhaps different enough from B subtilis and Salmonella. Like I mentioned above, this was how I thought to differentiate samples --> should I have used an index for different strains? Although how would one explain that salmonella still had very good mapping despite the bulk fastq data coming from a mixed bag
Indeed thats possible. it turn out for E coli, while there was spotty mapping, the region harboring the lambda prophage had an extremely high coverage and not at all spotty like the other regions...i wonder if that 'took up' most of the E coli reads
You can't be 100% certain with short reads. You did not map the data to all three genomes at the same time so you don't know how much multi-mapping there is going to be. Within the genome there is not a lot (the reads with MQ=0, which I assume stands for mapping quality, are multi-mapping reads).
Yes each sample should have been indexed separately. That way you know each read came from which sample/genome. When you are dealing with multiple samples/genomes you will need to know that. Also when you are going to try and compare amongst groups you need to be able to make sure that the comparison is valid and that you can do appropriate normalization on the data for comparisons.
There is no scientific explanation. Could entirely be by chance. It looks like the "prep" you did somehow worked better for that strain than others (assuming you did exactly the same prep for all at the same time). If you repeated the experiment (you will need biological reps BTW later) then it may be some other genome or all of them.
Virus must have had a chance to replicate during the lytic cycle and its genome being relatively small ended up with more copies in the prep.
Probably not. Generally there is always an excess of library reagents so you are not going to run out of those. Every fragment of DNA in a sample would have an equal chance of being made into a library fragment. If the viral DNA is present in much greater amount than bacterial frags then that is what you are going to see.
Ratio of viral to bacterial reads is something you will need to account for in your comparisons. Depending on what this experiment is about, it may be an important consideration.