Hi all, I have a few queries regarding GBS using TASSEL UNEAK Pipeline. This post might be a bit lengthy.
I have 100 sample data of withania somnifera. This does not have a reference genome. It was decided to use UNEAK pipeline for GBS. UNEAK does not require a reference genome.
Using the approach I could arrive at a hapmap file. It contains 1,564,918 entries. I believe this contains a lot of false positives and wanted to filter them out. The most intuitive approach was to eliminate the entries where my values were N (value is missing). Using a perl one liner I got the entries which were only 60. That means there are only 60 entries out of such huge number of entries where my locus has value for all 100 samples.
This has put me in a fix. Q1- I would appreciate if some one would be able to provide a one liner or code where I can filter out entries with "N" above 6.(Only entries which has atleast 6 values) should be printed.
The format is: rs# contains the SNP identifier alleles contains SNP alleles according to NCBI database dbSNP; chrom contains the chromosome that the SNP was mapped; pos contains the respective position of this SNP on chromosome; strand contains the orientation of the SNP in the DNA strand. Thus, SNPs could be in the forward (+) or in the reverse (-) orientation relative to the reference genome; assembly# contains the version of reference sequence assembly (from NCBI); center contains the name of genotyping center that produced the genotypes; protLSID contains the identifier for HapMap protocol; assayLSID contain the identifier HapMap assay used for genotyping; panelLSID contains the identifier for panel of individuals genotyped; QCcode contains the quality control for all entries
Followed by 100 samples' Alternate allele.
Q2- I do not have the above mentioned QCcode in the file. It has the column but lacks the values in it.(I assume beacause its DeNovo and does not have a reference to calculate the quality against.) So can someone suggest any meaningful filtrations that can be done in this case?
Q3- Does anyone have an access to paper where the algorithm for UNEAK is elucidated ?
Any meaningful suggestion is welcome even apart from these questions.
I don't know if there is a paper, but this manual describes it.