I know this error has been reported numerous times and have read through proposed solutions such as presented in this Error In Bedtools Getfasta: Chromosome Not Found and none of them have resolved this error. I am still unable to get my file to successfully grab sequences.
I actually haven't had this issue before, used the exact same script the only change is using a newer genome file GRCm39 instead of GRCm38, but both were downloaded from ensembl using their FTP links. Whats interesting is that when I run the command with the GRCm38 it works but not with the GRCm39.
I have no idea why it wouldn't work for both? Is it the index file? I use bedtools to generate the .fai file.
Here is a comparison header of the files and they look similar:
Mus_musculus.GRCm38.dna.primary_assembly.fa
1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Mus_musculus.GRCm39.dna.primary_assembly.fa
>1 dna:chromosome chromosome:GRCm39:1:1:195154279:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
For sake of completion here are the rest of the important pieces I am using
Here are a couple lines of the error:
index file /blue/berglund/j.ellis/annotation_files/FTP_GenesAndSeqFiles/Mus_musculus.GRCm39.dna.primary_assembly.fa.fai not found, generating...
WARNING. chromosome (19) was not found in the FASTA file. Skipping.
WARNING. chromosome (12) was not found in the FASTA file. Skipping.
Here is my script:
bedtools getfasta -fo ./fasta_SEevents_Struct_MEF.fa -fullHeader -s\
-fi /blue/berglund/j.ellis/annotation_files/FTP_GenesAndSeqFiles/Mus_musculus.GRCm39.dna.primary_assembly.fa\
-bed ./BED_SEevents_Struct_MEF.txt
Here is a sample of my bed file:
19 55924529 55924580 Tcf7l2 1 +
19 53242503 53242627 Add3 1 +
19 45364086 45364274 Btrc 1 +
Of note, to generate this bed file I use a python script to convert a pandas df to a bed file and upload it to the cluster using cyberduck. Below is the export command
bed_out.to_csv('./BED_SEevents_Struct_MEF.bed', sep = '\t', index = False, header = False)
Thanks to Pierre Lindenbaum's comment for the help! I figure out a solution. Since the .fai files were not indexing the chromosomes properly for whatever reason using bedtools in-built .fai generation I utilized
samtools faidx
intead and was able to get an .fai file that pulled just the chromosome #s not the whole line. Then it ran fine!Hi! Have you checked if your fasta file actually contains an entry for chromosome 19?
I just double checked using
cat Mus_musculus.GRCm39.dna.primary_assembly.fa | sed -n /19/p > a.txt
and in my out file I see it come up as expected.