Question

Alignment using BWA tool

0

Entering edit mode

5.6 years ago

kriti.awasthi23 ▴ 10

Hi All,

I am doing alignment using BWA mem using command

 bwa index -p H37Rv_reference_genome -a bwtsw GCF_000195955.2_ASM19595v2_genomic.fasta

 bwa mem H37Rv_reference_genome ERR3566420_1.fastq ERR3566420_2_trimmed.fastq >ERR3566420.sam
 samtools view -bS ERR3566420.sam >ERR3566420.bam
 samtools flagstat ERR3566420.bam

Now when I am seeing the percentage of the mapped reads I am seeing really less alignment to be precise (8.83%) paired and (7.19%) properly paired. My question is if both are related species of Mycobacterium how they are having such low alignment. Do I need to make come changes in the parameters?

Also I used Kraken tool for the alignment using this command

  ./kraken2 --db mycobacterium_database/ --threads 20 SRR8594817_1.fastq SRR8594817_2.fastq >SRR8594817_kraken.txt

So the results from this is (99.07%) alignment.

I am not able to understand that why am I getting such low alignment using BWA tool having reference genome as H37Rv genome and why the results from the Kraken tool are so contradictory.

Kindly help me with the possible suggestion, Thanks.

alignment metagenomics kraken BWA tool • 2.6k views

ADD COMMENT • link updated 5.6 years ago by swbarnes2 15k • written 5.6 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

Are you sure trimming one fastq and not the other is what you want?

ADD REPLY • link 5.6 years ago by swbarnes2 15k

0

Entering edit mode

Hi, I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?

ADD REPLY • link 5.6 years ago by kriti.awasthi23 ▴ 10

0

Entering edit mode

I am checking the quality of the reads and yes if the either one of the reads per base sequence is bad, I am trimming only that read (fastq_file). Should I trim both the fastq files?

Absolutely. If you don't do that then if a read is removed from the trimmed file then that will throw off the order of reads in the file pair. Aligners do not check to see if read pairs are present in corresponding proper order. It would be best to re-trim your data by starting over and using both files at the same time when you trim.

ADD REPLY • link 5.6 years ago by GenoMax 152k

0

Entering edit mode

Given that 99% of the untrimmed reads aligned to the Kraken database, they probably shouldn't bother removing any reads at all.

ADD REPLY • link 5.6 years ago by swbarnes2 15k

0

Entering edit mode

OK thanks, I'll once again repeat the procedure.

ADD REPLY • link 5.6 years ago by kriti.awasthi23 ▴ 10

score 0 · Answer 1 · 2020-01-02

0

Entering edit mode

5.6 years ago

swbarnes2 15k

I don't see a contradiction. Aligning to a bunch of mycobacterial genomes is not the same as aligning to a single one. And you aligned some kind of trimmed fastq to the single genome, which might have messed up your paired end matching.

ADD COMMENT • link 5.6 years ago by swbarnes2 15k

0

Entering edit mode

Even after doing the alignment for the good quality data where I didn't require to trim the reads, I could not find the significant alignment result when comparing it with the reference genome which belonged to the same genus.

ADD REPLY • link 5.6 years ago by kriti.awasthi23 ▴ 10

score 0 · Answer 2 · 2020-01-03

kriti.awasthi23 : I had a look at the dataset you posted originally (ERR3566420). With (GCF_000195955.2_ASM19595v2_genomic.fna:>NC_000962.3 Mycobacterium tuberculosis H37Rv) alignment stats were as follows using bbmap.sh aligner from BBMap suite.

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:              5.5300%             673         5.5300%             203246

This dataset has only ~3.6M reads.

I noticed that ERR* entry mentioned this to be Mycobacterium avium genome so with (GCF_000007865.1_ASM786v1_genomic.fna:>NC_002944.2 Mycobacterium avium subsp. paratuberculosis K-10) things improved a bit.

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:             20.6081%            2508        20.6081%             757416

SRR8594817 data is from yet another genome (GCF_000015005.1_ASM1500v1_genomic.fna:>NC_008596.1 Mycobacterium smegmatis str. MC2 155 chromosome) and alignment of this data to that genome results in (with ~10.1M reads)

Pairing data:           pct pairs       num pairs       pct bases          num bases
mated pairs:             96.0233%         4862842        96.0368%          990823412

So pretty good overall.

Conclusion: ERR3566420 is probably a bad dataset with not enough reads.