I've gotten so much help here previously and I'm looking forward to help here again!
I'm trying to use DEXSeq (a Bioconductor package) to identify and quantitate changes in splicing following expression of a transgene. I've performed the mapping of my Illumina paired-end data using BWA and now have SAM files. No problems up to this point.
The problem occurs when I run dexseq_count.py, a Python script associated with the DEXSeq package which is dependent on HTSeq in Python. I keep getting the following messages:
"UserWarning: Read HWUSI-EAS381R:1:10:12325:4848#CGATGT claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)"
After a couple of searches I've found that HTSeq seems not to like BWA-generated SAM files. What I have not found was a work-around for this (although some people write things like "I wrote a perl script to get around this problem" with no further details).
I can use Bowtie (which is recommended), but we have a BWA Convey system so the speed up would be phenomenal, and I have quite a few of these.
From this [?]old seqanswers post[?], the bottom two posts by Simon Anders who wrote HTSeq:
If you have paired-end data,
htseq-count expects that the SAM file
is sorted by read name. This is
because the SAM format prescribes that
paired reads have the same read ID,
and hence, sorting causes the read
pairs to appear in adjacent line.pairs to appear in adjacent line.
Your error suggest that HTSeq cannot find the aligned mate data. It might be because your SAM file is not sorted by read name.
Thanks so much for your help, DK. I've tried this and have thus far not been able to get the script to work. I'm trying the sampe step again in hopes that something went wrong there and that's why I'm not getting the script to work.
Yes, I've tried SAMTools, and although the sorting worked, the script still found problems. I've even used a linux one-liner. Neither has worked. I'm redoing the sampe step now just hoping that something went wrong there. Thanks for your help!
I wanted everyone to know what ended up happening with this:
I had used SAMtools sort, and was still getting the same problem.
It seems that when I demultiplexed my Illumina run, the same number of reads did not end up in each paired fastq file. So I had around 15 million reads in my read1 file, but only 11 million in my read2 file.
I wrote a perl script to make sure the same names are in each file. If you'd like a copy of this perl script comment on my answer here and I'll send it to you. Or if there's a way to share easily I'd be happy to use that too.
Thanks so much for your help, DK. I've tried this and have thus far not been able to get the script to work. I'm trying the sampe step again in hopes that something went wrong there and that's why I'm not getting the script to work.