Dear Biologists and Bio-IT-specialists,
based on your knoweledge and experience I'd like to know how many variants should be discovered from .BAM file (final result, after annotation and filtration)? This question is very general, so I'll specify it a bit.
Let's take some NGS data (fastq) from ENA databse. I performed whole analysis and here's my results:
WES data, paired end, Illumina HiSeq 2000, material - cancer cell line. Software that I used:
- FastQC - (before and after trimming adapters)
- Trimmomatic - trimming adapters
- Subread - mapping to genome - hg19
- Bamtools - filter, sort, index
- FreeBayes - variant discovery
- SnpSift i snpEff - filter and annotation of discovered variants.
After annotation of variants I run:
(QUAL > 1) & (QUAL / AO > 10) & (SAF > 0) & (SAR > 0) & (RPR > 1) & (RPL > 1)
as a filtering arugemnt of SnpSift. As a result I have .vcf file with 32866 variants. Do you think that this number might be correct for mentioned above criteria?
Why do I ask such question like this?
A few days ago I got several .vcf (ready-to-use) files of WES experiment of another cancer but from patients... Analysis were performed by some specialist that I don't know. I was a bit confused becaue each .vcf file contains 300000-350000 variants with filter PASS. To be honest I don't know the pipeline - of analysis that returned above results, however variant discovery was performed by SAMTools...
So my questions are:
Do you think that my analysis and tools that I used are correct? I'd like to avoid false negatives when some mutation is present and it's absent in my result .vcf file...
How many variants of WES experimrnt (final result) analysis do you usually have?
Best regards,
Adam