Entering edit mode
5.4 years ago
tmrhyd
•
0
I have a Fasta and a VCF of a plant genome, for sequence A and sequence B. Sequence A is the Wildtype, and sequence B is an improved mutated version of it. I have been comparing the variants of the two sequences in IGV, but need to annotate the genes. I have the variants annotated with snpEff, but I need the genes not the variants.
What should I use? I submitted the Fasta to maker, but its taking years, and DAVID doesn't allow VCF or fasta.
Thank you so much for your help!
I mean this in the best way possible, but based on this question and your previous questions you have no idea what you are doing. We are happy to help with specific questions, but what you need is someone to sit next to and think about what you need to do and need to get. This question, as your previous, just don't make sense at all. DAVID is very different compared to "annotating plant genes" and it's no wonder that it doesn't accept VCF or fasta.
Did you assemble the plant yourself? Is there a reference genome available? How did you use SnpEff (add commands)?
No, I honestly need a guide or basic overview. I'm really lost and just trying to get Manhattan plots from the data given to me. Do you know of any good resources for a basic overview on the genomics pipeline creation process?
I am so sorry about the wording of my questions, I didn't know how else to word them.
I did assemble the plant if my understanding of assembly is correct. I got the read data from Illumina and then trimmed it myself, aligned it against the reference genome, converted it to .Bam, marked duplicated, sorted, and indexed it. I then converted the files into VCF and am trying to get the read data into Manhattan and q-q plots. So yes, there is a reference genome available.
my SnpEff commands for annotating the variants are listed below
I really am sorry for wasting anyones time, I'm just kind of stuck, and am really trying to figure it out.
Please let me know if I just need to read something or learn something!
Thank you -
Thanks for explaining, that helps.
I believe we have said this before: you cannot do a statistical test with just two samples, and therefore there is no sensible way for you to make a manhattan plot or QQ plot out of this data.
Think about it: if you have 50 differences between sample A and sample B, how will you say which one is more likely to be a functional difference (associated with your phenotype)? In a genome-wide association study commonly thousands of individuals are used to say something about the frequency of a variant in group A compared to group B. It just doesn't work that way if you have only 2 samples.
That's not assembly then, but it doesn't really matter in this context. What you did is an alignment and variant calling. Assembly would be the procedure to create a new (reference) genome using only the sequenced reads.
Does the fact that each sample has hundreds of variants not provide the data set for the statistical test to get the P data and therefore the Manhattan plot?
Thank you so much for your help!!
No, the Manhattan plot shows the p-value per variant/position. You test each variant separately, and would also correct for multiple testing