Hello, I need to collect a set of ~330,000 SNPs from a set of extended-VCF files (available here and here. The only information I have on these SNPs are there dbSNP rs#s.
I've tried to do this using vcftools, but vcftools f'ed up the extended format aspects of the VCF files, and thus the output was useless.
I have been told by one of the individuals who generated these original files that I absolutely cannot use vcftools or GATK on these files because they break the extended format. I have been told that I need to use a "a simple VCF parser (SAMtools library/tabix implementation, e.g. pysam)", but I am a noob and don't know how to do this or most of what this means. I've asked my Python/computational biology professor for help on this before, and we could not figure this out. There is more filtering beyond just filtering rsids (e.g., modernC-archaicT or modernG-archaicA sites), but I am holding off on that for now, since I want to get the simple task done first.
I need to get some Pysam python code to use as a template to filter these vcf files. Can someone please help. Thanks!
-Deven
For any given rsid, I need corresponding the whole line, the code looks like it will just print the whole line for every line it reads through. Do you know how I can get it to only print the line if it contains one of the rs ids?
The two sets of VCF files are here and here and they are broken up by chromosome. First, I need to go through all 22 autosomes for each and pare it down, so it is just covering the lines corresponding the ~330,000 rsids. Then I need to further filter out case where modern humans have a reference C or G and the archaic human has a T or A (respectively). This part gets confusing in how the reference/alternate coding works in the extended literature, so I am holding that filtering off until I've re-read through the methods literature.
The files for the first link are quite big, and the second one does not work.
Try
print line["rsId"]
or whatever this column is named, that should give you the rs id for the current line. I am not sure what you mean by "confusing in how the reference/alternate". Couldn't you just use the ref given to you by dbSNP or the reference human genome?http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/
print line["rsId"]
would just get me the rsid for the line, what I need is for it the go through the sets of file and copy all of the line for only the lines containing the ~330,000 rsids.I would imagine would need an if statement after
for line in f:
saying if line contains one of my rsid then print line, but my Python is really bad, and I don't know how to format that if statement. That is what I need help with at the present.In terms of the reference sequence, that I can trust, but the alternative modern human allele data in the Neanderthal file (and in certain cases in the Denisovan file) is not actually stored in the alt column, they are stored in the INFO field under 1000gALT. Just ignore anything about filtering beyond the rsids for now, because I am still trying to figure out what I need to filter in the first place.
The parser works the following way:
If the field you are looking for is on the first level, like pos, you do
line["pos"]
, if it's in 'INFO'line["INFO"]["rsId"]
, if it's in 'FILTER'line["FILTER"]["rsId"]
. If there's no value you will get aNone
. So it's simplyTo get a list of files in a folder, take a look a the glob module of python:
glob.glob
will get you the list of all files in a directory.For the human reference, the answer is a little bit longer. I would suggest that you import the human reference genome into pyGeno(quickstart), retrieve the chromosome and then simply do
allele = chromosome.sequence[pos]
. Where position is the position of your SNP, that should give the allele. Then it's simply a matter of:Hope that helps
You are understanding anything that I am saying.
I do not need any specific fields. I need entire lines, but I only need entire lines which contain specific rsids. The code you were trying to give me would copy any line that had any rsid irrespective of whether or not I wanted that rsid. The code you have been giving is not what I need.
For example, in every line of the file there is an ID column, where the rsid (if the site has one) is listed. What I need the code to do is to look in that ID column, and if the ID column contains one of the ~330,000 rsids, then it will print that line to an outfile. This is what I need, but I do not have a bloody clue how to write that if statement.
The
if
statement within thefor
is what I need help writing right now. Can you please help me code that?