Hello,
before asking my question, I should point out that I'm working with data that's not my own (publicly available), to learn and establish a proper workflow when real data wlll arrive in the laboratory.
I'm dealing with some exome data[1] from an Ion Torrent 318 chip and I'm trying to run the GATK RealignerTargetCreator on it to perform recalibration later on. The problem is that some reads have a deletion at the end:
read ends with deletion. Cigar: 179S54M1D5M1I9M1D
And thus they're not processable by GATK. How to handle this case? Is the workflow I used (outlined below) to blame for this?
Steps I did:
First, QC: keep reads with a phred score of at least 20 in 80% of the bases (python script modeled over the fastx toolkit).
Then, realignment with bwa bwasw (consider that reads by Ion Torrent can go up to 250 bp):
bwa bwasw -t 8 hg19.fa C30-101.filtered.fastq > C30-101.sam
Followed by conversion to BAM, addition of RG groups, sorting, and indexing (pysamtools).
Then GATK was invoked as
gatk -T RealignerTargetCreator -R hg19.fa -o input.bam.list -I C30-101_RG.bam
(gatk
is a small wrapper that merely hides the java -Xmx -jar ...
stuff.)
[1] http://lifetech-it.hosted.jivesoftware.com/docs/DOC-2659 (registration may be required)
Replace "179S54M1D5M1I9M1D" to "179S54M1D5M1I9M1S" (last D to S). Sorry.