Well... My god have to fix a lot things before Im able to call my snps
I got my bam file directly from fastq using BBMap (thanks to genomax)
Of course my bam file is not sorted and no index, so I do...
samtools sort out.bam sorted.bam
samtools index sorted.bam sorted.bai
Then I continue at next step
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fa -I sorted.bam -o sorted.intervals
and I get this error message
Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
and I use this code to fix it
java -jar /home/cri/Desktop/picard1/AddOrReplaceReadGroups.jar I=sorted.bam O=header.bam RGLB=LIBRARY RGPL="Ion Torrent" RGPU=RUN RGSM=SAMPLE RGCN=BCM
Then I try again
java -jar /home/cri/Desktop/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg19.fa -I header.bam -o header.intervals
Then I get another error (i jump to error to error... ^" )
ERROR MESSAGE: Invalid command line: Cannot process the provided BAM/CRAM file(s) because they were not indexed. The GATK does offer limited processing of unindexed BAM/CRAMs in --unsafe mode, but this feature is unsupported -- use it at your own risk!
I realize my bai file was called before sorted.bai so I rename to header.bai...didn't work either, then I index again my header.bam
samtools index header.bam header.bai
and Identify target regions for realignment finally works....
why I need to do twice the index?
GATK is the only program that requires read groups and there is no getting around that requirement. Next time add the readgroups before sorting and indexing.
Thanks to all for the tips :) (I put in my notes)
My pc works now on step: Local realignment around indels :))))
seems is going to take looong time.... my god need lot patient to do this kind of job...or maybe a better pc :P
Yes it will take a while. A few steps in GATK can be sped up by throwing more cores at it (and of course there is the trick of splitting up the analysis by chromosome after alignment and running the pipeline).
Thanks Chris
ok..so I can split the jobb, smart...If the power electricity go out at home I will cry, I will see if find the code how to do it!
I have now 14h left! My pc just have 4cores amd, I invested most at ram (32gb) and I wait until they will pay me at work (this is Spain...) so I can look for a better CPU :)
If they don't pay you the least they can do is give you access to a server/cluster at work? You really should be doing this type of work on a remote machine that is beefy and is on a UPS, if you have unreliable power at home.
well..Nearly three days took for do realignment around indels...
Now Im stock at this step
1) Count covariates:
I get this error
ERROR MESSAGE: Walker CountCovariates is no longer available in the GATK; it has been deprecated since version 2.0 (use BaseRecalibrator instead; see documentation for usage)
So I check for next code:
ERROR MESSAGE: Could not read file /media/cri/CRIS_DATA/ULT/OTRO COPIA/otro/latest_dbsnp.vcf because file 'latest_dbsnp.vcf' does not exist
Im bit losted how to get that vcf file
location: ftp.broadinstitute.org username: gsapubftp-anonymous password: <blank>
I go to
ftp://ftp.broadinstitute.org/distribution/human_SNP_releases/ but dont see any vcf file to download
Thanks
Which guide are you following for this exercise? Can you post a link?
Also use
ADD REPLY
when posting additional information like this.SUBMIT ANSWER
should only be used for new answers for the original question.I did wrong....sorry
http://seqanswers.com/wiki/How-to/exome_analysis
Thats the main pipeline I follow, but at the end many steps has been modified like use BBMap...
Now I down the file from here, dunno if its the right one
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/GATK/
Thanks!
(I fixed one of the error message....I did mistake with copy and paste)