bedtools getFasta doesn't match chromosomes error
1
0
Entering edit mode
2.1 years ago
Joseph ▴ 10

I know this error has been reported numerous times and have read through proposed solutions such as presented in this Error In Bedtools Getfasta: Chromosome Not Found and none of them have resolved this error. I am still unable to get my file to successfully grab sequences.

I actually haven't had this issue before, used the exact same script the only change is using a newer genome file GRCm39 instead of GRCm38, but both were downloaded from ensembl using their FTP links. Whats interesting is that when I run the command with the GRCm38 it works but not with the GRCm39.

I have no idea why it wouldn't work for both? Is it the index file? I use bedtools to generate the .fai file.

Here is a comparison header of the files and they look similar:

Mus_musculus.GRCm38.dna.primary_assembly.fa
1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Mus_musculus.GRCm39.dna.primary_assembly.fa
>1 dna:chromosome chromosome:GRCm39:1:1:195154279:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

For sake of completion here are the rest of the important pieces I am using

Here are a couple lines of the error:

index file /blue/berglund/j.ellis/annotation_files/FTP_GenesAndSeqFiles/Mus_musculus.GRCm39.dna.primary_assembly.fa.fai not found, generating...
WARNING. chromosome (19) was not found in the FASTA file. Skipping.
WARNING. chromosome (12) was not found in the FASTA file. Skipping.

Here is my script:

bedtools getfasta -fo ./fasta_SEevents_Struct_MEF.fa -fullHeader -s\
 -fi /blue/berglund/j.ellis/annotation_files/FTP_GenesAndSeqFiles/Mus_musculus.GRCm39.dna.primary_assembly.fa\
 -bed ./BED_SEevents_Struct_MEF.txt

Here is a sample of my bed file:

19      55924529        55924580        Tcf7l2  1       +
19      53242503        53242627        Add3    1       +
19      45364086        45364274        Btrc    1       +

Of note, to generate this bed file I use a python script to convert a pandas df to a bed file and upload it to the cluster using cyberduck. Below is the export command

bed_out.to_csv('./BED_SEevents_Struct_MEF.bed', sep = '\t', index = False, header = False)
fasta getfasta bedtools • 1.2k views
ADD COMMENT
1
Entering edit mode

Thanks to Pierre Lindenbaum's comment for the help! I figure out a solution. Since the .fai files were not indexing the chromosomes properly for whatever reason using bedtools in-built .fai generation I utilized samtools faidx intead and was able to get an .fai file that pulled just the chromosome #s not the whole line. Then it ran fine!

ADD REPLY
0
Entering edit mode

Hi! Have you checked if your fasta file actually contains an entry for chromosome 19?

ADD REPLY
0
Entering edit mode

I just double checked using cat Mus_musculus.GRCm39.dna.primary_assembly.fa | sed -n /19/p > a.txt and in my out file I see it come up as expected.

>1 dna:chromosome chromosome:GRCm39:1:1:195154279:1 REF
>11 dna:chromosome chromosome:GRCm39:11:1:121973369:1 REF
>19 dna:chromosome chromosome:GRCm39:19:1:61420004:1 REF
>7 dna:chromosome chromosome:GRCm39:7:1:144995196:1 REF
ADD REPLY
1
Entering edit mode
2.1 years ago

what are the outputs of the following commands

file Mus_musculus.GRCm39.dna.primary_assembly.fa
file BED_SEevents_Struct_MEF.bed
cut -f 1 BED_SEevents_Struct_MEF.bed | uniq | sort | uniq | paste -s -d ';'
grep "^>" Mus_musculus.GRCm39.dna.primary_assembly.fa  | cut -c2- | cut -d ' ' -f1 | paste -s -d ';'
cut -f1  Mus_musculus.GRCm39.dna.primary_assembly.fa.fai   | paste -s -d ';'
ADD COMMENT
0
Entering edit mode

In order:

Mus_musculus.GRCm39.dna.primary_assembly.fa: ASCII text

BED_SEevents_Struct_MEF.bed: ASCII text

1;10;11;12;13;14;15;16;17;18;19;2;3;4;5;6;7;8;9;X

(im shortening the output for the numbered chromosomes for brevity) 1;10;11;12;13;14;15;16;17;18;19;2;3;4;5;6;7;8;9;MT;X;Y

1 dna:chromosome chromosome:GRCm39:1:1:195154279:1 REF;10 dna:chromosome chromosome:GRCm39:10:1:130530862:1 REF;11 dna:chromosome chromosome:GRCm39:11:1:121973369:1 REF;12 dna:chromosome chromosome:GRCm39:12:1:120092757:1 REF;13 dna:chromosome chromosome:GRCm39:13:1:120883175:1 REF;14 dna:chromosome chromosome:GRCm39:14:1:125139656:1 REF;15 dna:chromosome chromosome:GRCm39:15:1:104073951:1 REF;16 dna:chromosome chromosome:GRCm39:16:1:98008968:1 REF;17 dna:chromosome chromosome:GRCm39:17:1:95294699:1 REF;18 dna:chromosome chromosome:GRCm39:18:1:90720763:1 REF;19 dna:chromosome chromosome:GRCm39:19:1:61420004:1 REF;2 dna:chromosome chromosome:GRCm39:2:1:181755017:1 REF;3 dna:chromosome chromosome:GRCm39:3:1:159745316:1 REF;4 dna:chromosome chromosome:GRCm39:4:1:156860686:1 REF;5 dna:chromosome chromosome:GRCm39:5:1:151758149:1 REF;6 dna:chromosome chromosome:GRCm39:6:1:149588044:1 REF;7 dna:chromosome chromosome:GRCm39:7:1:144995196:1 REF;8 dna:chromosome chromosome:GRCm39:8:1:130127694:1 REF;9 dna:chromosome chromosome:GRCm39:9:1:124359700:1 REF;MT dna:chromosome chromosome:GRCm39:MT:1:16299:1 REF;X dna:chromosome chromosome:GRCm39:X:1:169476592:1 REF;Y dna:chromosome chromosome:GRCm39:Y:1:91455967:1

here is the output using GRCm38: 1;10;11;12;13;14;15;16;17;18;19;2;3;4;5;6;7;8;9;MT;X;Y

So clearly the GRCm39 has a different naming scheme or there is something weird that happened when I tried to download that file. I used wget to grab the file. Not sure how to go from there?

Thank you for the reply!

ADD REPLY
0
Entering edit mode

I figured it out with that help, my comment details that. Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 1465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6