Hello there,
I have RNAseq data and I'm looking for some variants. I plan to call it with the GATK pipeline described here:
http://gatkforums.broadinstitute.org/gatk/discussion/3891/calling-variants-in-rnaseq
The problem is that I have 3 samples and several data (3 paired end) for each sample. Example:
1) Sample 1
PE 1
PE 2
PE 3
2) Sample 2
PE 1
PE 2
PE 3
3) Sample 3
PE 1
PE 2
PE 3
In order to get one vcf file at the end, Iam confused how to deal with these files. I have some ideas, but I'm not sure so your suggestions will help me :)
1) Merge all the PE together so I'ill have one PE data for each sample ?
2) Mapping with star each run and add read group information ?
3) At which point do I merge the sample ?
Ok so the idea is to apply this step:
2) Add read groups, sort, mark duplicates, and create index
for each run and sample (so 9 jobs in total)
and use samtool merge to produce one sam file.
FInally apply "3. Split'N'Trim and reassign mapping qualities" step with that unified sam ?
You can always do over with the SAM step and directly produce the BAM format and pre process the STAR aligned BAM files with GATK. You can combine the bam files merging with RG tags and then run GATK processing steps for each samples and finally perform the variant calling.