Morning Gang,
I feel like I am just missing something in the documentation for this but here goes, I'm currently running some salmonellas through the works using a very well annotated reference and we have come across a large drop out of genes in one location. We went back and confirmed they are not there with a quick pcr so I know the data is correct, however when I go to make the consensus chromosome I can not figure out how to include the drop out in the file.
For the mapping I'm using TMAP (ion torrent data, single end reads), freebayes for the haploid vcf creation, and SnpEff for the SNP annotation.
I know if I want to create a new chromosome of our sample I can use the vcf file to make the SNP changes in the reference file, however that will retain the genes that have dropped out because I'm making the changes to the reference not the alignment.
Does anyone have a suggestion or a work around for this type of problem?
Thank you in advance, Sean
EDIT 15MAR2016
So I have been working on this for some time now and I'm close to my solution but I'm having an issue the "new" sequence. Below is my code.
## this makes a dict of the samtools depth coverage input file of all the 0's
coverageDict = {}
## this loops over the input depth information and appends the dictionary with
## a key,value as genome position, depth of coverage
coverage = open("/home/sbrimer/Desktop/Columbia ST/1819_1/coverage.txt","r")
for line in coverage:
coverages = line.split("\t")
coverageDict[coverages[1]] = coverages[2]
coverage.close()
## this makes a set of the large dict of just the missing.
missing = {index for index,value in coverageDict.items() if value == 0}
def filter_coverage(index,value):
return 'N' if index in missing else value
ref = open("Salmonella enterica subsp.fasta","rU")
## this should read in the sequence file as an object, change it to a mutable
## object, then anywhere the coverageDict value = 0. Replace that nt with "N"
for seq_record in SeqIO.parse(ref,"fasta"):
refSeq = seq_record.seq
new_seq = "".join(filter_coverage(index,value) for index,value in enumerate(refSeq.tomutable(), start=1))
new_rec = SeqRecord(Seq(new_seq),id="new_id",description="new_description")
SeqIO.write(new_rec,"NewSeq.fasta","fasta")
ref.close()
Everything seems to work fine until I want to add the 'N' to the fasta file, I can get the generic "new_id" and "new_description" to show up in the test file, but not the newseq. I have also tried the following:
first this
refSeq = seq_reocrd.seq
refSeq_mut = refSeq.tomutable()
then
refSeq = str(seq_record.seq)
but for whatever reason I can not seem to change the seq_record.seq part of the record. Has anyone else had this issue?
You could replace the dropped region with a stretch of N's to keep the length consistent.
I thought about that as well. I know that in samtools I can use mpileup to make a consensus that will use the "N" as the placeholder and I thought about remapping to that and use the existing vcf file to annotate. I guess my biggest issue is that I can not locate the coordinates of the dropout without using a genome viewer.
Good edit! Before you didn't mention Biopython if I remember correctly. Please see my hints to improve the post below (so please, no offence):
The XY problem: is it really a good solution to add N's (maybe it was a very good idea by the commenter)? N's signify there is some sequence there but it is unknown what it is, if it is a deletion, it might be better to simply remove the bases.
It is very good you included your code, could you also provide a reproducible example, e.g. create a small dataset including the coverage file and a fasta file. If everyone can run your code with ease, then more people will run your example. The more self-contained the example (no downloading of files), the more people will simply copy-paste and debug for free.
(You do not open or close the output file, is it not necessary? It looks like you are overwriting a single record over the existing file in each iteration. But I am not a python expert, python files might be smarter than perl files ;) )
Thanks Michael, I really appreciate the input!
Thank you again :)