So I have finally generated some reads and run it through what I guess could be called a very rudimentary 'pipeline'. I generated a million paired end reads with wgsim then aligned with bwa, and used samtools/bcftools/vcftools. The commands ran like this:
bwa aln -t 10 hg19 -f test.read1.sai test.read1.fq
bwa aln -t 10 hg19 -f test.read2.sai test.read2.fq
bwa sampe -f test.sam hg19 test.read1.sai test.read2.sai test.read1.fq test.read2.fq
samtools view -S -b test.sam > wgsim100bpPE.bam
samtools sort test.bam test_sort
samtools mpileup -uf ../hg19/hg19.fa test_sort.bam > test.bcf
bcftools view -vcg test.bcf - > test.vcf
Now I know this is simplistic and I know other tools are available, and I will try more in the future. What I would like to know is if there are any other steps or parameters I should be using with this existing workflow to improve it? e.g. make it more efficient. Also, if I wanted to run 150 samples, would I just run this 150 times in a row, or would I want to do things differently.
Any help would be appreciated. And please be kind ;)
Nice! I have almost the exact same chain for a germline analysis workflow.