Hi all
I need some help with sequence alignment using BWA as I'm at my wits end.
Background: I've sequenced wheat DNA (note, highly repetitive) using the Illumina HiSeq 2000. I have contigs and scaffolds as assembled by the sequencing company and I have two reference genomes that should be pretty similar to my sequences. I have attempted sequence alignments of contigs and scaffolds against both reference genomes using BWA bwasw with default parameters. Note that the two reference genomes are in large scaffolds themselves.
Problem: I keep getting really bad alignments! Aligned sequences are heavily truncated in every instance, with less than half the sequence actually aligning to the reference! And if something does align, it's repetitive sequence. I have masked repetitive sequences and this alleviates the problem a little but doesn't solve the massive truncating problem. In addition, the majority of my sequence data doesn't align to anything and is clustered under "unaligned sequences".
Anyone with BWA experience or any experience with sequence alignment? I'd really appreciate a push in the right direction as this has been keeping me busy for much too long!
Thanks!
Hi, could you please add some examples of the commands you used, and of the version of the data formats you are using? It is difficult to know what is the cause of your problem otherwise. Can you check if the quality scores in your fasta file are encoded using the phred+33 scheme, and not phred+64 (see wikipedia - fastq)? Another possibility is to use a software like Trinity for doing assembly without a reference genome.
The scripts I was using to align is as follows:
The data formats are in fasta. I did create an index for the reference genome before hand.
I'm not sure about the quality scores for my fasta file - what impact could that have?
I'm not really using the mapping to assemble, I need to compare to the reference genome for comparison's sake.
The quality scores would affect your results because if they are not encoded correctly, bwa may think that the sequences have a lower quality than the real one, and align them more poorly. But I don't think that this is your case anyway. In any case, you should look at the distribution of quality scores in your dataset (e.g. check the option QualityScoreDistribution in Picard Tools).
Another question is if there any specific reason why you used the bwa bwasw algorithm. The bwa mem should be more accurate, did you try that?
p.s. in the command line you posted there are some errors in the paths to the reference genome index file and the fasta files.