Hi All, This question seems quite simple, but, it seems to be very difficult to get an answer.
I have a file cns_pairs.fastq
containing reads. I have built an index using command:
./bwa/bwa aln -t 16 ./hg19.fa ./cns_pairs.fastq > ./cns_pairs.sai
After which, I output the .sam file by calling:
./bwa/bwa samse ./hg19.fa ./cns_pairs.sai ./cns_pairs.fastq > ./cns_pairs.sam
As you can see I am aligning against file hg19.fa
, which is hg19, however, the reads in cns_pairs.fastq
all
derive from chromosome 22; these were generated by use of ART sequence simulator, where hg19_chr22.fa
was
used as input.
The length of chromosome 22 in the file is reported as 51304566 in the header; indeed, if I count these with a script I get the same value; just a sanity check. However, there are a large number of 'N' character in this file, which are present at the beginning and of the file. Again, if I write a python script to count the index that the 'N' characters end (from the start), and the index that the 'N' characters start (from the end) I get the values of 16050000 and 16697850 respectively.
For clarity, a diagram of what the file looks like is as follows:
idx: 0|1|...|16050000|16050001|...|16697849|16697850|16697851|...|51304566
chr: N|N|...|A |T |...|G |N |N |...|N|
So, we can see there is continuous 'N' characters from 16697850 until the end of the file.
When I check the .sam file, I find that a portion of reads (~500) have aligned at positions within this final stretch of 'N' characters. For example, one read aligns at position 45418248. All the reads align at an index from 0 - 51304566, so they are clearly aligning to the length of the chr22 file, and this is not aligning to the reference outside of chromosome 22, whilst flagging the header as chromosome 22.
This seems very strange to me, and I have a couple of questions relating to this issue:
- How/why are these reads be aligning to the N position?
I would have imagined bwa would want to prevent this. I have read one post online, suggesting that, because of bwa 2-bit representation, 'N' and other characters need to be converted into a random sequence of [ATCG] - however, in the same post, another user has opposed this comment, suggesting that bwa cleaves the 'N' characters. Looking through the source code, it is very difficult to identify the code io section that deals with extended character sets, and how they are handled.
- Is there a flag that can be set in order to prevent alignment to 'N'?
Any help here would be great - this is the most important question, as I want to be able to remove these alignments. If this is a known problem, and there is no hope (apart from modifying the source code), but another alignment algorithm handles this issue, then please, suggest other alignment algorithms. I am using alignment as a sub-routine for an algorithm I am developing, another alignment algorithm would work just as well if it solves this issue.
Best, Izaak
UPDATE:
function add1()
seems to handle the situation, i'm working my way through to see if I can get the bottom of exactly how 'N' chars are handled. Anybody wants to chip in with advice, here is the link:
UPDATE 2:
Indeed, add()1
does handle the situation, and it looks as though a random [ATCG] generate is
used to replace an 'N'.
Example sam entries of reads aligning to contiguous 'N' regions - excuse the weird header (QNAME) it is required for our algorithm:
UPDATE 3:
It may be worth noting, that these "reads" that we are aligning are the product of out algorithm.
Consequently, we need to generate the quality string associated with the read ourselves; for this
we decided to give a maximum quality ~
for all positions. I have read around that bwa does
take this quality string into account for alignments. Perhaps we should be using a randomly generated
sequence, in order to not bias the results - or, maybe be assuming the worst. Any thoughts?
GATTACAGACGTGAGCCACTGTGCCTGGCCCTTTTTTTTTTTTTTTTTTTTTTTTTTCCAAAGAGATGGCGTTTCACTCT[11;18;43704232] 16 22 29934750 37 109M * 0 0 CACTCCATCCTGAGCAATAGAGTGAAACGCCATCTCTTTGGAAAAAAAAAAAAAAAAAAAAAAAAAAGGGCCAGGCACAGTGGCTCACGTCTGTAATCCCAGCACTTTG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:109
TGAGCCACCGCATCCGGCCCTTTTTTTTTTTTTTTTTTTTTTTTTTTCCTAGAGGGAGTCTTACTCTTGTCACCCAGG[12;11;43704235] 16 22 32375876 37 101M * 0 0 CTGTTCTCCAGCCTGGGTGACAAGAGTAAGACTCCCTCTAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGCCGGATGCGGTGGCTCATGCCTGTAATCC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTCGGCTGGAATGACCTGGG[31;9;43704240] 0 22 47159264 37 92M * 0 0 CCACCTCCTATGTTGTGTTAGCAGGAAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTCGGCTGGAATGACCTGGGTACACCCTA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:92
TTTTCTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTGCACTTCGCC[24;26;43704255] 16 22 39258488 37 100M * 0 0 CAGGTGGGGCTCCACTGTCCTGTGTGGGCGAAGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAGAAAGAAAAAAGAAAAACAAAAGAAAAAAGAAA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:100
ATTGATGATATGCATTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCCGGAATACCACAGAAGGGAGATTGGGT[8;0;43704259] 16 22 17803639 37 80M * 0 0 ACCCAATCTCCCTTCTGTGGTATTCCGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAATGCATATCATCAATCTAATCCT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:80
So, not only do the reads align, to what appears to be these 'N' regions, they also have a high MAPQ score;
which I suppose means they align very well. Indeed the NM field of BWA reports a zero edit distance!!
I would say this is highly unlikely that these reads would align with a zero edit distance, given the contiguous
'N' sequences are filled with random [ATCG].
Is there something wrong with how I am using bwa aln
given the above commands? Or a flag I can add?
UPDATE 4: There was no effect on the alignment by switching the fastq quality string to the lowest quality.
show us somes lines in the SAM where a read map in a 'NNN' region...
What you're seeing wouldn't be unheard of - I had a very similar problem with SOAPaligner: SOAPaligner 2.21 - does it replace all Ns by Gs in reads?
It seems a bit fishy; the overwhelming majority of my false positives ~ 4000 come from apparent alignments to these regions.
Hmmm... thanks Philipp, interesting. That seems almost like the opposite of my problem (as for me the Ns are in the reference, rather than the reads), but handled with similar logic (logic in a code sense; for me this handle seems pretty illogical!).