Hello everyone,
I am new to NGS analysis, but have tried to learn through test data before starting with my own data. And now, it seems I am stuck somewhere. So, looking for some help/suggestions/ideas for the same.
I have few patient samples (Human) for which WGS was performed and I am trying to understand the underlying variations in the same. The data that I had received from the sequencing company was performed in multiple libraries (2) and multiple lanes (2-6). I individually performed QC for all fastq files and then started aligning them separately using bwa mem w.r.t. hg38 reference genome (chrX). After individually aligning them, I merged and sorted them using samtools for further variant calling. But now, when I visualize the alignment files I notice a great variation in the read count for all the samples.
I have no idea why it is happening this way. Is there something that is wrong with the protocol I am following or can it be because of some issue in library prepartion? Has anybody else experienced this or have any suggestions which might suggest what I am doing wrong here?
Any leads/inputs would be greatly appreciated.
Thanks in advance, Neeru
What kind of QC did you perform? Did you start out with similar number of reads in all samples/libraries?
Are you just aligning to chrX or to entire genome?
I used FastQC for QC. Yes, all samples have comparably similar number of reads across libraries. I am aligning to hg38-chrX not the entire genome, as my main area of interest right now is chrX variation. Depending on leads I get from here, I would be performing whole genome alignment as well.
If you are not doing proper analysis i.e. aligning data that came from entire genome to whole genome then all that can be said is you have libraries that don't have uniform representation of the genome. That may extend to other chromosomes beyond X when you do end up aligning to the entire genome.
and/or
There may also be a wider problem (contamination, bad libraries) that you would need to first address.
Thank you for your response.
I just finished whole genome alignment for one of my sample with respect to complete hg38 reference and now when I visualize the alignment the trend remains the same i.e. few regions showing high read count (400-500) while others have low read count (30-50). Also, it does appear to me that this discrepancy is linked to exonic vs non-exonic regions.
Does it simplify the original issue or rather complicated it further?
Any thoughts?
Was this a simple whole genome sequencing experiment? Depending on amount of care taken with sample prep and handling of DNA your data is now locked in. You will need to decide if it is worth moving forward. Especially if the data you have does not contain data about chrX that you are interested in.
Yes, this was a simple whole genome sequencing experiment.
I am still waiting for a response from the sequencing company to check if they might have messed up in anyway that could explain this behavior. And also, the experimentalist who processed the blood samples and send DNA for sequencing.
Thanks for your inputs again
Wouldn't you expect half your samples to have twice the X coverage of the others?
Not really. I know the patients and except one all are females, and that is why we suspect it to be a X-linked disorder. This is precisely why I started with chrX alignmnet first rather than going for whole genome. I actually expected alot of other things as well which I am still not able to get. I will check if the Whole genome alignment of other samples still show the same trend.
I will post an update.