Hi :)
I'm trying to run Bismark through Picard, and Picard really doesn't like it. At first I thought perhaps Picard was being a little over-the-top regarding the standards, but when i looked at the alignments it is complaining about, i dont have any idea what this is supposed to be either:
> HWI-ST552:160:C2J56ACXX:2:1213:17561:58963/1 99 * 19232 7 101M * 19505 374 TAATAGAGGAATGGATATAGAAAATGTGGTATATTTATATAATGGAATATTATTTAGTTATTAAAATAATGAATTTATGAAATTTTTAGGTAAATGGATGG CCCFFFFFHHHHHJJJJJJJJJJJJJHHIFHIJJJJJJJJJJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJIJJJIJJIIIJJJIJJJGHFHHHFFFFFFA NM:i:14 MD:Z:0C2C13C13C5C1C9C2C1C2C8A17C0C4C10 XM:Z:h..x.............x.............h.....h.h.........h..h.x..h..........................hh....h.......... XR:Z:CT XG:Z:CT
> HWI-ST552:160:C2J56ACXX:2:1213:17561:58963/1 147 * 19505 7 101M * 19232 -374 GTTTTTTTTTAGAGATGGGAATAAAATATTTATGGAAGGAGTTATAGAGATAAAATTTGGAGTTGTGATGAAAGGATGGATTATTTAGTGATTGTTATATG 9DDDDFFFFFFHHHGGGJJIIJJJJJJJJJJIJIJJJJJIJJJJJJJIJJIJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFCCC NM:i:20 MD:Z:1C0C0C0C2C0C12C4C1C0C0C13C5C11C5C11C0C2C9C0C5 XM:Z:.hhhh..hh............h....h.hhh.............x.....h...........x.....z...........hh..h.........hh..... XR:Z:GA XG:Z:CT
The read flags indicate both reads - which have the same pair ending of /1 (I grepped on the QNAME before the /) - are both mapped to the reverse strand in a proper pair (which isn't possible). They also have POS and TLEN values which make sense, but neither are mapped to any chromosome. This would be OK if the reads were marked as unmapped, but their marked as mapped in a proper pair!
What am I supposed to do with these reads? I'd like to throw them in the bin if possible, but i dont know how to:
- Count the reads Picard doesnt like
- Throw away the reads Picard doesnt like if the above number is small enough.
Thank you! :)
Are those directly from bismark or after picard may have munged them?
Straight out of Bismark (EDIT: this wasnt true, see below)
Eeek, maybe I should make bison available to everyone on the cluster. I had previously used it a bit to test the slurm queue...
What version of bismark is this? What version of bowtie2 are you using it with?
Bismark v0.15.0 Bowtie 2.2.7 64bit :)
I'd be very happy to try out bison. At the moment the consortia is using an in-house WGBS mapping pipeline (methylCtools) and i'd like a second opinion so I can say I did my due diligence.
Honestly, just install bwa-meth from brentp and see what it gives you on that sample. When we compared bwa-meth and bison the qualities were pretty similar and bwa-meth is much easier to install (granted, it doesn't scale like bison, but for a single sample that's not an issue).
I've had my issues with methylCtools (mostly the methylation extraction stuff, which it does wrong).
Ah, well BWA is my aligner of choice! :D OK i'll try bwa-meth.
I'll give it a shot and compare it to Bismark's output which i have for this 1 sample now (took me like 100 hours to map the four lanes, so I feel compelled to persevere :P)
In a recent presentation i said "methylCtools seems to use a version of bwa thats 4 years old, but this is probably a mistake" - to which i was informed this was not a mistake. When my data is published, the aligner will be 5 years old -_- Perhaps that's fine, but its means Bismark/bwa-meth is definitely worth a look..
Wait, wait oh god - i made a mistake.
The output from Bismark was unsorted, so I sorted it via "samtools sort". I sorted by position.
I just ran FixMateInformation again on the file that actually came "straight out of bismark", and it seems to have worked :/ I mean, it gave a ton of errors about "Alignment start should be 0 because reference name = *", but the error about the offending reads has gone. Weirdly, the reads are still there in the bam:
OK, well, whatever - I will continue as usual I guess :/
Would still be very good to see bison on the cluster if that streamlines all of these issues for people :)
I should update this with, even though I managed to FixMateInformation, Picard's ValidateSam really doesnt like the output of Bismark and whilst going deeper into why, it seems some reads Bismark generats have totally unique QNAMEs. QNAMEs not in the original FASTQ. This whole thing puts such a big question mark over what is coming out of that program, that I ditched it in favour of bwa-meth which i got up and running without any troubles at all.
bwa-meth was also a LOT faster than bismark, and really maxed-out my CPU usage rather than Bismark which was memory bound. The downside is that bwa-meth only does C -> T conversion on the reference genome, whilst Bismark does that and an G -> A conversion. I don't know what that means practically, but once i have some mapped reads i'll understand better.
This will only matter if you have a non-directional library, which you don't unless you're using something really old[1].
[1] "really old" is probably 3+ years old in this context. Even then, though, everyone was switching to directional library prep.
Is there any way to tell from the data if it's directional or not from just the sequenced reads? Perhaps from the output of MarkAdapters (or any other adapter-marking software)? Finding the person in the consortia who did the sequencing isn't so easy for me :<
The only way to tell is to align a few reads and look at the metrics. Normally one would align a 100,000 or a million reads with bismark or something else in non-directional mode and see if the complementary strands have any reads. It's been a while since I tried to use bwa-meth, so I don't know if this is possible directly with it.