I have 50bp PE RNA-seq, and 4 of my ~200 samples end in error after running Cufflinks (v.2.2.1). Here is my pipeline prior to running Cufflinks:
- STAR 2-pass (GRCm38)
- Added read group information with Picard
- Marked (but did not remove) duplicates with Picard
- SplitNCigarReads with GATK
- Realigned around indels with GATK
- Performed base recalibration with GATK
I've been looking at one particular sample with the error and have narrowed it down to a couple reads (see below):
PC168224:108:C90EBANXX:5:1303:19181:37569 179 6 85461128 60 43M345H = 85461508 338 CCCTCAGATCATCATCCGAGCTTTCCGCACAGCCACCCAATTG C46FE@3CDECDBDD>/C6FE?FD@:EA?5CDBB>33<::;<B MC:Z:380H4I4S BD:Z:KOOOOMMLMMLMMLLMMLNNJFLLMNNNMOPPQSSQVSSKMMM PG:Z:MarkDuplicates RG:Z:HL126 NH:i:1 BI:Z:JNOOSQOQNOOMNONJKKOMIBKNJNMOOPQLOPQNUSSNNMM HI:i:1 nM:i:1 MQ:i:70 AS:i:97 XS:A:+
PC168224:108:C90EBANXX:5:1303:19181:37569 179 6 85461508 70 380H4I4S = 85461128 -338 GGTGGCTG AC?9=5EB MC:Z:43M345H OC:Z:380H4M4S BD:Z:ONNOPPMM PG:Z:MarkDuplicates RG:Z:HL126 NH:i:1 BI:Z:NORNPPMM HI:i:1 nM:i:1 MQ:i:60 AS:i:97 XS:A:+
Now if I put those to reads into a separate BAM file (with original header from the sample) and running cufflinks I get the following:
cufflinks -o test. test.bam
You are using Cufflinks v2.2.1, which is the most recent release.
[14:04:40] Inspecting reads and determining fragment length distribution.
Segmentation fault
I've look at those two reads but am not sure why about them causes a segmentation fault. Any ideas?
Ah, that seems to be it. Replacing the cigar string with the following fixes the issue:
Now to figure out why there are bad cigar strings in some of my samples...
i Think it's still wrong anyway because an alignment cannot start with an insertion.
You're probably right. I was just testing it out to see if what you said would fix the issue. I' am going to see how systematic an issue these bad cigar strings are in my files, and if it very few reads then I'll just remove them.