Hi, there
I was trying to add read groups to my alignments and take advantage of GATK to call variants. I first did my cleaned reads and added read groups to the new alignment with (just one example file, I have multiple alignments for the whole dataset) :
bwa aln -n 0.04 -k 5 -I -t 4 contig ~/Desktop/Test_dataset2/1_cold_trimmed_clipped.fastq | bwa samse -r '@RG\tID:1_cold\tSM:1_cold\tPL:Illumina' contig - ~/Desktop/Test_dataset2/1_cold_trimmed_clipped.fastq > ~/Desktop/Test_dataset2/auto_data_31/aligned_sam/1_cold.sam
Then I converted my generated sam file and sorted them with :
samtools view -bS 1_cold.sam | samtools sort -m 300000000 - ./sam_bam/1_cold_sorted
Then I replace the duplicates:
java -Xmx2g -jar ~/Desktop/apps/picard-tools-1.92/MarkDuplicates.jar INPUT=1_cold_sorted.bam OUTPUT=1_cold_dedup.bam METRICS_FILE=1_cold_metricsfile MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=250 ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=True
Then merged all of the files after reducing duplicate:
samtools merge -rh RG.txt merged.bam *dedup.bam
samtools index merged.bam
java -jar ~/Desktop/apps/picard-tools-1.92/CreateSequenceDictionary.jar R= contigs.fa o=contigs.dict
samtools faidx contigs.fa
Then I was ready to realign with GATK:
java -Xmx2g -jar ~/Desktop/apps/GenomeAnalysisTK-2.6-5-gba531bd/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R contigs.fa \
-o merged_output.intervals \
-I ~/Desktop/Test_dataset2/auto_data_31/aligned_sam/sam_bam/merged.bam \
--minReadsAtLocus 3
There was an error message:
##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/home/bq/Desktop/Test_dataset2/auto_data_31/aligned_sam/sam_bam/merged.bam} is malformed: Read HWI-ST141_0363:2:1101:11445:6328#TAGCTT uses a read group (5_hot_dedup) that is not defined in the BAM header, which is not valid. Please use http://gatkforums.broadinstitute.org/discussion/59/companion-utilities-replacereadgroups to fix this problem
And I went back to the original file to check header: They were both there.
samtools view -h 5_hot_dedup.bam | grep "^@RG"
@RG ID:5_hot PL:Illumina SM:5_hot
grep "^@RG" 5_hot.sam
@RG ID:5_hot SM:5_hot PL:Illumina
And I rerun the whole analysis from the original sequence, the same error occurred again, Could you please shed some light on where I could possibly do wrong? I had worked on this for nearly weeks, but little progress has been made. There might be some other options like in picard addreadgroups, but i really want to know what did i do wrong with the analysis? Thank you very much for any suggestions and sorry for the long post.
in
samtools merge -rh RG.txt
what's the content of RG.txt ?Here is the content in RG.txt for the complete samples