Hi all, I am quite new in this field. So, I have very basic questions. I try to map my reads to Killifish reference genome using the original bwa software. I have young and old samples. The .fastq file contains ~560 M reads, 760 M reads etc. but after the alignment, only a small fragment of the reads appeared like 32 M, 41 M etc. I now it can be lower than fastq because of removing adaptors, PCR duplications but the decrease is very high and then I looked at the read depth using samtools and their results were 3.35, 4.29 etc. Since I will search the genomic rearrangements, this numbers are very low. I could not understand that what is the problem? Why does the bwa trim the reads in the alignment? Is this because of the low quality, or am I doing something wrong?
Another question is I tried to calculate coverage percentage using coverage = (read count * read length ) / total genome size this formula, If I use reads that in Fastqc report, my results are more than 100%. But if I use mapped reads, my results are very low. For instance;
Total mapped reads 32474596,
read length 150*2 (pair-end),
total genome size 1.53 Gb,
Coverage = 6.37
FastQc total reads 5603750,
read length 150*2 (pair-end),
total genome size 1.53 Gb,
Coverage = 109.9
Could you tell me what is the mistake here? Is it normal to have a result that is more than 100%? Or which reads should I use to calculate this percentage?
Thank you for your help!!
That does seem to indicate some problem with alignments (unless the data was really bad to begin with). You will need to take some of the reads that do not seem to align and blast them at NCBI to see what genome they align to.
Regarding your question about coverage: vertical coverage is not commonly expressed as a percentage. Even in the best-case scenario where insert size is >300 and you get 150*2 bases from each read, in your second case it's still 5,603,750 * (150*2) / 1,530,000,000 = 1.1x.
Regarding low mapping: this can have multiple causes, but as a ballpark scenario I'd check FastQC (bad quality seqs? Are you trimming them or not before alignment, and if so, how?) and I like to use FastQ-Screen to check for contaminants. How are your sequences in this regard?