bwa index -p H37Rv_reference_genome -a bwtsw GCF_000195955.2_ASM19595v2_genomic.fasta
bwa mem H37Rv_reference_genome ERR3566420_1.fastq ERR3566420_2_trimmed.fastq >ERR3566420.sam
samtools view -bS ERR3566420.sam >ERR3566420.bam
samtools flagstat ERR3566420.bam
Now when I am seeing the percentage of the mapped reads I am seeing really less alignment to be precise (8.83%) paired and (7.19%) properly paired. My question is if both are related species of Mycobacterium how they are having such low alignment. Do I need to make come changes in the parameters?
Also I used Kraken tool for the alignment using this command
I am not able to understand that why am I getting such low alignment using BWA tool having reference genome as H37Rv genome and why the results from the Kraken tool are so contradictory.
Kindly help me with the possible suggestion,
Thanks.
Hi,
I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?
I am checking the quality of the reads and yes if the either one of
the reads per base sequence is bad, I am trimming only that read
(fastq_file). Should I trim both the fastq files?
Absolutely. If you don't do that then if a read is removed from the trimmed file then that will throw off the order of reads in the file pair. Aligners do not check to see if read pairs are present in corresponding proper order. It would be best to re-trim your data by starting over and using both files at the same time when you trim.
I don't see a contradiction. Aligning to a bunch of mycobacterial genomes is not the same as aligning to a single one. And you aligned some kind of trimmed fastq to the single genome, which might have messed up your paired end matching.
Even after doing the alignment for the good quality data where I didn't require to trim the reads, I could not find the significant alignment result when comparing it with the reference genome which belonged to the same genus.
kriti.awasthi23 : I had a look at the dataset you posted originally (ERR3566420). With (GCF_000195955.2_ASM19595v2_genomic.fna:>NC_000962.3 Mycobacterium tuberculosis H37Rv) alignment stats were as follows using bbmap.sh aligner from BBMap suite.
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 5.5300% 673 5.5300% 203246
This dataset has only ~3.6M reads.
I noticed that ERR* entry mentioned this to be Mycobacterium avium genome so with (GCF_000007865.1_ASM786v1_genomic.fna:>NC_002944.2 Mycobacterium avium subsp. paratuberculosis K-10) things improved a bit.
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 20.6081% 2508 20.6081% 757416
SRR8594817 data is from yet another genome (GCF_000015005.1_ASM1500v1_genomic.fna:>NC_008596.1 Mycobacterium smegmatis str. MC2 155 chromosome) and alignment of this data to that genome results in (with ~10.1M reads)
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 96.0233% 4862842 96.0368% 990823412
So pretty good overall.
Conclusion: ERR3566420 is probably a bad dataset with not enough reads.
Hi Sir,
Thank you so much for providing so much clarity. I am new to this kind of analysis. Can you please guide me as to how you have done the analysis and how you are viewing the stats result (precisely commands)?
Also my basic question is after alignment ho much depth and coverage we should check for the alignment samples. Also can you please guide me with how you got to know 3.1million reads and 10.1million reads respectively.
Also I want to know as to how to understand which data is good and which data to consider for further analysis ? I am checking the quality of the data by FastQC but still even after checking the quality of the data the alignment doesn't happens. What is the criteria for selecting the data ?
Are you sure trimming one fastq and not the other is what you want?
Hi, I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?
Absolutely. If you don't do that then if a read is removed from the trimmed file then that will throw off the order of reads in the file pair. Aligners do not check to see if read pairs are present in corresponding proper order. It would be best to re-trim your data by starting over and using both files at the same time when you trim.
Given that 99% of the untrimmed reads aligned to the Kraken database, they probably shouldn't bother removing any reads at all.
OK thanks, I'll once again repeat the procedure.