Hi folks,
First of all I'm an undergrad summer intern and this is my first time ever working in the field of bioinformatics, so I have no idea what I am doing. I am sure I am going to misuse many words in this post. I have been instructed to construct a variant caller to detect SNPs from a genome in comparison to a single reference genome. I am using 10x genomics sequenced BAM files ( http://www.10xgenomics.com/technology). These BAM files assign barcodes to reads. Using this barcode information, I have been able to assign small reads to larger molecules (up to 700k BP).
I understand the general Bayesian methods to detect variants, such as the freeBayes variant caller ( http://arxiv.org/abs/1207.3907). So basically, I have information that a bunch of reads belong to the same molecule, and thus the same chromosome. How can I use this information to help me with detecting SNP variants. I am happy to answer any questions about what I have written, or 10x sequencing technology.
Any ideas or insight would be extremely helpful. If I would be able to speak with or message somebody about this project, I would be very grateful. I am very lost and in over my head with this projects.
Thanks, Will
Current title makes it sound like you are announcing a new variant caller in this post when you are actually looking for one for 10x genomics data. You should amend the title to reflect that need.
Good call, thank you!
Welcome to biostars. I assume you don't have to construct a new variant caller and are free to use an existing one. Comparing sequenced reads to the reference genome is indeed the common method to detect variants.
Specifically for this type of data this seems the most appropriate: http://www.10xgenomics.com/software/, but this employs GATK and/or FreeBayes.
Thanks! Unfortunately, I do have to write my own variant caller. Im sure I can leverage existing callers such as freeBayes, but I some how need to incorporate the information relating to barcodes I have generated so far. I'm sure it doesn't have to be state of the art or super efficient, but I was assigned to create my own variant caller for 10x data.
It is my understanding that barcodes are only for compartmentalizing initial data (which must have been done by 10x software already). I am not sure what kind of barcode information you are trying to incorporate in the variant calling.
There is an existing software suite that does most of what you have been tasked to do. This software is free (though not open source). Loupe visualization software requires a license.
This is an ambitious project; you're going to want to take a step back and do it in steps. I'm assuming you know something about programming, and here are some steps that might help:
You will be lost following freebayes source code unless you have lots of experience in C++. Erik Garrison is a knowledge and practiced developer, and it's not easy to follow well-developed code with no prior experience of the language standards or methods employed in the field.
Thank you very much Steven. I am experienced in programming and C++, and have been working on this project for a while. It is the genomics/bioinformatics part that I have no experience in. I have already done the first two steps you mentioned, and will take your advice for steps 3 and 4. I appreciate your advice!
Good stuff. Since you're farther along, some other papers to check out are the Samtools statistics paper, and this Nature paper (statistical section). Those, along with freebayes, are a good overview of current methods.