Hi,
I'm trying to obtain upstream sequence of DNA by using getSeq and I'm getting the following error. Please help.
library(BSgenome.Hsapiens.NCBI.GRCh38)
for (i in range(1:142325)){ if (!check.integer(case_chr[i]))next else up_seq[i] <- getSeq(Hsapiens, "case_chr[i]", start = case_loc[i]-1000, end = case_loc[i] - 1)}
Error in .getOneSeqFromBSgenomeMultipleSequences(x, names[i], start[i], :
sequence case_chr[i] not found
I'm assuming I'm facing this issue because getSeq function takes only chromosome number as its first argument and there are no chromosome numbers for a few rows instead of which there is something like CHR_HSCHR1_1_CTG11. Am I supposed to filter them out?
Thanks. Could you please tell me about what columns the input file is supposed to contain for bedtools flank?
Bedtools for most of the programs like intersect, coverage, flank, getfasta etc accepts bed, gff & vcf format. VCF is not relevant here. GFF and BED expects different things in different columns and also differs by being 1-based and 0-based repectively. A concise explanation is there at this link. Please see.
Now, if you have just chromosome, start position & end position , you can right away create a bed file. For a gene from 1st to 100th base in chromosome one, the bed entry would be like. chr1 0 100 gene1
Take care to use tab as separator.
If you have a gff or bed from some source you may directly use that. In case there is trouble, just update with a snippet of input file and the command used.
I get the following error when I run the following :
Error: The requested genome file could not be opened.
How do I go about this?
You have to provide genome length file after "-g" option instead of .vcf file. In bedtools flank, they have mention the below note: "In order to prevent creating intervals that violate chromosome boundaries, bedtools flank requires a genome file defining the length of each chromosome or contig"
Uday, Genome file here means a file that gives the names and lengths of scaffolds/chromosome in a multifasta.There are numerous programs avaialble for it or you may write a one liner or so. The file extension does not matter.
Thanks. Where can I find the same for the human genome?
I suppose you have a fasta file dowloaded for human genome. If not, download it from Ensembl or any source your purpose necessitates.
Once you have fasta file, then getting the genome file is only a matter of running a small script like from this page
Yes, I do have the fasta file of each chromosome. Is there a way I could avoid using Python and do the same using terminal? Thanks.
There is a bioawk option also given at the bottom of the mentioned page.
Thank you so much all of you for your quick response. I'm getting the following error on running bedtools : ERROR: chrom "CHR_HSCHR1_4_CTG3" not found in genome file. Exiting. Any solutions for this?
It means chrom "CHR_HSCHR1_4_CTG3" is present in .vcf file but not in genome size file. Try finding this chromosome ID in your genome size file and if it is not there, then add the same in file.
Genome size file contains just the chromosome number followed by its length. I don't think adding it is a viable solution as I have nearly hundreds of such instances.
Please share the output the below two commands in order to identify what could be wrong.
Does this help in rectifying anything? I'm assuming the bed file to be my output with appropriate ranges mentioned in it. The above mentioned vcf file is the input to the bedtool flank.
Yes. It rectifies many things.
Probably you have downloaded the vcf from some source. The human genome fasta upon which the vcf is generated in not the same human genome fasta file upon which the genome file is created.
Solution: Get the genome file generated on the same fasta file on which the vcf is generated for case1. (Try to identify the particular Hg assembly used in generating the VCF and download it)
Yes indeed, the vcf file I have is an output of Ensembl's Variant effect predictor (VEP GRCh38.p12 assembly), while genome fasta file information is taken from ncbi (same assembly). Any alternatives other than filtering such SNPs?
Looks like even though same assembly from NCBI some thing different in Ensembl.
Alternative.
Go to ensembl page . GRCh38.p12 itself should be there. Go to the download FASTA option. Once downloaded,cross check the first column values in this fasta and your vcf.
If good to go, make genome file with this fasta and redo flank.
First column values in fasta? I'm sorry I didn't get you, could you please elaborate? You want me to download genome fasta chromosome wise? And on finding its length, it's still the same length.
My mistake. I intended the headers of the fasta and the first column of vcf.
To elaborate I expect all the missing entries to be present in the fasta from Ensembl.
Do not download individual chrs. Instead go for the toplevel fasta.
Once downloaded run the script for genome file with lengths,then check the first columns of that with that of the vcf. That should do. If all uniq entries of first column of vcf is not there in genome file then something else still need to be sorted out.
Most likely things will be fine with the Ensembl fasta. Please check.
Alright. Will do as you told and get back to you in sometime.
I'm getting the following error when i try to run the script for genome file using bioawk :
The file required is ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz . Please download and run the script.
Worked! Thanks a ton. Now I have the following :
Just a minor hiccup, how can I obtain it in proper .bed format which can be passed as an argument to bedtools getfasta? As in chr start stop format?
What is the snippet pasted?
Output of bedtools flank. But apparently the input format for bedtools getfasta is chr start stop.
Please share the exact flank command used.
Problem solved! Thanks :).
Do you know what happens if
bedtools flank
runs into a neighbouring feature? Does it consider/report it somehow?I have not seen any reporting/error/warning.
While using flank for promoters the scenario happens all the time.
If you take 1kb or 2kb upstream of a gene start, in many plant genomes I have noticed that it would run inside the gene region of some other gene.
Usually there is limiting if the coordinate is going beyond the chromosome/scaffold length, for which purpose, genome file is made required.
I see, could be even more extreme in case of fungi with their dense genomes. That'd be a nice option do add for bedtools.