I have some BAM files, which were trimmed using BBDUK:
java -ea -Xmx7174m -Xms7174m -cp jgi.BBDuk in=test.fastq.gz out=test.fastq.gz.trimmed_clean ref=polyA.fa.gz,truseq_rna.fa.gz k=13 ktrim=r useshortkmers=t mink=5 qtrim=r trimq=10 minlength=20
And aligned with STAR using the following example command:
STAR --genomeDir refgenome --readFilesIn test.fastq.gz.trimmed_clean --outFileNamePrefix test.fastq.gz.trimmed_clean_align --sjdbOverhang 73 --runThreadN 11 --outSAMtype BAM SortedByCoordinate --twopassMode Basic --twopass1readsN -1 --outFilterScoreMinOverLread 0.5 --outFilterMatchNminOverLread 0.5
I sort and add on read groups (WTCHG_524616_7039_1) and after running ValidateSamFile I got a warning
java -jar /data/BCI-EvoCa2/jacob/Central/software/picard.jar ValidateSamFile I=output_test.sorted.bam MODE=SUMMARY
warning:
WARNING:MISSING_TAG_NM 7774993
which I fixed with:
samtools calmd -bAr test.sorted.bam ucsc.hg19.fasta > output_test.sorted.bam
However after when I run HTSeq to call counts using the following command:
python /data/BCI-EvoCa2/salpie/HTSeq-count-master/HTSeq/scripts/count.py -s no --format=sam output_test.sorted.bam /data/BCI-IBD/analysis/RNA/gencode.v19.annotation.gtf > test.sorted.sam.output
After the GTF file is read in I get the following error:
Error occured when processing SAM input (line 7461 of file output_test.sorted.sam):
local variable 'NH' referenced before assignment
[Exception type: UnboundLocalError, raised in count.py:201]
When I look convert the file to a sam and then rerun to find the line that's wrong in the sam file, it looks like this. Can anyone tell me what's wrong with it?
K00150:332:HVM5TBBXX:7:2124:9303:4532 16 chrM 221 255 43M10439N16M * 0 0 GCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCCTAGTCTCAATCTCCAA FJJJFA7JJAJJJFFJJJJFJFFFJJFFFJJJFFJJJFFFFJJJJJJFF<-AJFF7<<A RG:Z:WTCHG_524616_7039_1 NH:i:1 HI:i:1 nM:i:1 AS:i:56 NM:i:1 MD:Z:42G16
Other lines look like this (before and after)
Line above
K00150:332:HVM5TBBXX:7:2124:19573:1930 16 chrM 221 255 52M * 0 0 GCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAGCCGCTTTCCACA EJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFFFAA RG:Z:WTCHG_524616_7039_1 NH:i:1 HI:i:1 nM:i:0 AS:i:51 ZQ:Z:E@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ NM:i:0 MD:Z:52
Line below
K00150:332:HVM5TBBXX:7:2124:11353:4532 16 chrM 221 255 56M * 0 0 GCTTGTAGGACATAATAATAACAATTGAATGTCTGCACAACCGCTTCCCACACAGA EJJJJJFFJJJJJJJJJJJJJFJJJJJJFJJJJJJFFJ::JJJJ2/-/2FFAF<AA RG:Z:WTCHG_524616_7039_1 NH:i:1 HI:i:1 nM:i:2 AS:i:51 ZQ:Z:E@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@PL@@@@X[][X@@@@@@@ NM:i:2 MD:Z:39G6T9
The 4th column in the line at fault reads "43M10439N16M" while the other okay lines read "52M " and "56M " is this causing the error and does anyone know how to fix this?
Thank you!