Hey Stars!
I have a really confounding issue at hand. I am working on extracting upstream regions of genes from 100 different genomes of A. thaliana. The problem being, I have one reference genome for TAIR10 version (which has an annotated GTF/GFF) and the rest of the genomes I have are consensus-builds from VCF files (having no annotation data available):
cat Arabidopsis_thaliana.TAIR10.55.dna.toplevel.fa | vcf-consensus some.vcf.gz > vcf.fa
I have extracted the upstream regions of some target genes from the reference genome using RSAT
rsat retrieve-seq -org Arabidopsis_thaliana.TAIR10.55 -feattype gene -type upstream -format fasta -label id,name -from -2000 -to -1 -noorf -i Genes.txt -o ups.fa
Next, using the upstream coordinates from the above step, I extracted the sequences from the rest of the genomes (consensus-builds). But, now that I am comparing the consensus-extracted upstream sequences with the reference-upstream sequences and their respective positions in the original VCFs, they do not match up. I think this may be due to indels in the VCF. I am looking for any suggestions/methods to extract the reference upstream sequences (with alternate allele insertions) from the VCF genomes.
Any and all help is highly appreciated. Thank you!
Please review all your previous questions and comment or validate them (green tick on the left) : How reproducible is transcript quantification through salmon? Generate hashes for all sequences in a FASTA file FASTA to GTF/GFF using reference genome How to export stdout of bash script to a text file
All done! Thanks.
related: Vcf Locations In Consensus Sequence