In a recent experiment with a few thousand barcoded cells I wanted to investigate common SNPs. I have a large .bam file consisting of all reads from all cells which passed quality control for which I have run variant calling and subsequent QC on the called variants - this was saved in calls.vcf
(the filtered vcf file contains ~70k sites).
I split my large bam file into one bam file per cell and also ran variant calling on individual cells using calls.vcf
as my regions file meaning I now have a large number of vcf files (one per cell) containing variant data on that cell in the specified regions. Using these vcf files I would like to construct a SNP-Cell matrix.
Is this possible using already released packages?
And how should this matrix look like?
Honestly I am unsure! I think the nature of the variant isn't too important, only that it has a label. Then for each (barcode, SNP label) pair I would either have a 0, 1 or 2. 0 would be homozygous reference, 1 would be heterozygous and 2 would be homozygous alternative allele (sorry if these labels aren't correct - I am a mathematician on a rotation project!). I think the idea would then be to perform some sort of dimensionality reduction on the (probably very sparse) matrix, followed by some sort of clustering.
The final two steps should be very easy once I have the matrix and it should be possible to create a matrix through some clever scripting but I just wondered if there were any standardized way of doing this!
don't you want a multi-sample VCF ?
or how about using https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php to merge all your VCFs ?
I did consider merging my VCF files using each cell as a separate sample. Do you know if this would allow me to perform subsequent dimensionality reduction and clustering? Or would the data have to be loaded into some sort of dataframe first? (I apologise I am very new to bioinformatics in general...)