Hi there!
I'm looking for someone to help me in imputing genotype data by either using impute2 or Sanger Imputation. I'm open to any other tool that's easily reproducible too.
These patient samples are not genotyped and the SNP (in VCF format) was called from RNASeq data (50bp SR) using GATk method.
I'm offering USD 50.00 for showing me how to run imputation successfully by using an actual VCF file from my dataset. I'll send you an email from my institutional ID, detailing how the data was processed and steps I've taken on my own to troubleshoot it. I haven't used impute2 and have mostly worked with Sanger's online resource.
If you're interested, please PM me and I'll email you from my institutional email. I can pay via paypal/btc/bch/eth.
Thanks!
I would not spend too much of my time (and money, apparently) with variant calls from RNA-seq using GATK's method unless I was absolutely sure of the pitfalls / limitations of such data:
A: Inferring genotype based on RNA sequnces
What is it that you are ultimately aiming to achieve?
Hi Kevin!
Thanks so much for responding. I really appreciate your concern regarding the robustness of the variant calls from RNASeq data. I guess what I'm really trying to do, is to troubleshoot the imputation which I can hopefully apply to future studies.
I wanted to try my hands on live data and the only live data I have on my hand is RNASeq data. This dataset is only 30 patients (RNASeq) but if I can 'conclude' this, I would be able to work with other 400 matched (control and treatment) patient samples which I can send for DNASeq.
You want to impute RNA-seq variant calls? - would only really work for SNPs flanking the 5 and 3 prime regions of each gene. One of the commonest imputation methods is indeed IMPUTE2, which you mention.
Nobody here wants your money. We just offer free advice here, on our own time. Why not just post the commands / things that you have already tried? If you want to give the money away, then you could choose a good charity.
Ok! I'll keep trying. I've tried every troubleshooting step that was suggested. I don't know where to get DNASeq data to practice with.
EDIT:
I started with the GATk method for variant calling.
Step 7 of the pipeline generates my filtered VCF files:
Following this, I zip and index my VCF into vcf.gz
For Sanger Imputation Service, we convert the UCSC-style chromosome names to Ensembl-style chromosome names by running:
Going back to the absolute beginning, when I was running DESeq2 on my RNASeq data, I had mapped it to GRCh38. Since Sanger requires the coordinates to be on GRCh37, I remapped my data to GRCh37.p13 from Gencode and repeated variant calling. While building the initial index the parameters were as shown below
I used sjdbOverhang 49 because my RNASeq is 50bp (read -1). The gtf file used "contains the comprehensive gene annotation originally created on the GRCh38 reference chromosomes, mapped to the GRCh37 primary assembly with gencode-backmap"
The ref dictionary was created :
For past few days, I'm stuck at this error and I can't get past it.
I know this doesn't look like much but it took me weeks to get so far. Most of my bioinformatics training comes Biostars, Seqanswer and r/bioinformatics.
When and IF I have the DNASeq data, I want to do an eQTL analysis.
If the error is the major blocker for you, then please have a look at my answer to your previous question.
Thanks Michael. I missed that previous post. sandKings, open access NGS data can also be downloaded from ENA - a tutorial is here: Fast download of FASTQ files and metadata from the European Nucleotide Archive (ENA)
Also note, sandKings, that (generally) data produced by GATK's methods frequently have issues in terms of compatibility with other methods. The GATK has it's own support forum - you may want to go there.
Hi Kevin, I've sought help at GATk too but I'm afraid my incessant posts are wearing everyone down. Anyway, I'm following up with Michael's advise and trying to find the final build for GRCh37. I'm insisting on using GRCh37 because I find Sanger's interface more 'noob' friendly.
Okay, don't worry about the GATK - they don't appear to integrate that much with the 'community', which indirectly renders much of their tools incompatible with the standard ones. You will always be welcome here. I presume that you have been working your way through This, and are stuck on this point:
If you look at your VCF headers (bcftools view), can you give a list of all the contigs present? For Sanger Imputation, your contigs have to be named according to this: ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz.fai
Hi Kevin, sorry for the delay in getting back to you. My VCF header (?) looks like this before UCSC to ensemble conversion
The complete list of contigs is too long and is exceeding the character limit for the post. However, apart from the standard contigs,
the list also contains
And the columns look like this:
Then, I convert it to the ensemble format using the command
which reformats my columns to:
I see all these other 'J' and 'K' contigs in my file. Following Michael's suggestion here I tried to find what appeared to be the final version of GRCh37 from Gencode and downloaded the primary assembly (GRCh37) file and Comprehensive gene annotation GTF file. I created the new STAR index:
and then proceeded with rest of the GATk pipeline:
Unfortunately, now I'm getting this error:
Trying to solve this 'sorting' issue using < java -jar picard.jar ReorderSam> error took me down another rabbit hole so I'm taking a break and will start from the top.
With your ensembl.vcf.gz, you may not need to do anything else with the GATK apart from tab-index the file in an attempt to fix the VCF header, with:
NB - it should be zipped with
bgzip
, notgzip
Have you tried to use that file for Sanger Imputation?
---------------------------------------
It also looks like there is some conflicting information in the STAR indices and the actual VCF contig names. For SAnger Imputation, you literally just need your contigs named like This.
If you still have the 'chr' prefix on your contigs and need to sort these numerically, these commands work:
The
tabix
command will update your VCF header with the new contig names. You can technically remove the old contig names from the VCF header without problem.With that, for all intents and purposes, the file should be ready for Sanger Imputation Server.
Also, to set the correct reference genome base in your VCF, you do not need the fixref plugin. The following will set these in your VCF:
--check-ref x
will eliminate the non-matched ref sites, which may be better. Also, here, I use human_g1k_v37.fasta as the reference, which is hg19 updated with allele information from 1000 Genomes Phase III. For Sanger Imputation, you may want to use: hs37d5.faHopefully, with all of this information, you can actually get your work done.