Hello,
I recently did a whole genome sequencing (PE150) experiment on 5 strains of C. elegans. I aligned everything using SNAP and after processing the data saw that across all of samples, only 33% of the reads mapped to the reference. I tried mapping to E. coli as well to ensure the remaining reads weren't from contamination, but only ~3% mapped there. When I looked at the unmapped reads using:
samtools view -f 4 bamfile.bam
I saw that the unmapped reads I looked at all have either the sam flag 133 (read paired (0x1), read unmapped (0x4), second in pair (0x80)) or 165 (read paired (0x1), read unmapped (0x4), mate reverse strand (0x20), second in pair (0x80)).
If I blast the read sequences, some have no match, but many have perfect or near perfect matches to the C. elegans genome, but only to a subset of the read, maybe 100-145ish bases. Is there a reason I would have so many of these flags in my data and is there anything I can do about it at this point to correct these unmapped reads?
EDIT: I've looked more closely at the blast results for the reads and found that there are many cases where half of the sequence matches a C. elegans sequence perfectly and the other half's best match is Diabrotica undecimpuntata virus 1, Asarum shuttleworthii chloroplast or Cyprinus carpio dna (these ones have come up several times). There are other cases where 120-140 bases in the read match a segment of C. elegans DNA and the remaining bit matches a nearby sequence, but running in the opposite direction.
Thank you,
Tyler
Are you sure that you have filtered the data properly to remove adapter sequences and low quality bases? If yes, just to get quick idea of different contamination present in your sample, you can upload your reads in Kaiju web server.