Entering edit mode
8 months ago
ummeswaiba
•
0
Hello everyone,
I am using .txt file, converted to input.bed, which contains coordinates to the gene sequences (around 1507 in total). To fetch these sequences I'm using this command:
bedtools getfasta -fi hg38.fa.align.gz -bed input.bed -fo output.fast
As you can see, I'm using hg38.fa.align.gz as my reference genome. However, when I run it, I get this error.
index file hg38.fa.align.gz.fai not found, generating...
ERROR: mismatched line lengths at line 3 within sequence
File not suitable for fasta index generation.
I did try to fix issue with reference genome file, but to no avail. Can anyone suggest any fix, I would be grateful.
Is there a significance of that word in file name? You need to use a plain fasta sequence file as reference. Not an aligned fasta format file.
There is no specific significance of align, and I have shifted to plain fasta file. The code runs but I get this warning for majority of the chr.
I have rechecked my bed file as as well. It is in Bed6 format. both fasta and .bed file contain chrn. what could cause this problem?
Your chromosome names need to match in BED and the reference file. You also appear to have some problem with the BED file at the line noted above.
Can you show us output of:
Sure:
As you can see your fasta files have headers that seem to have extraneous information beyond just
>chr1
. So that is reasonbedtools
can't match the two. It also looks like you don't have a contiguous sequence for entire chromosome. In order forgetfasta
to be able to find an interval e.g.chr1 3718559 3719552
the entire sequence will need to be present as a single record.I am not sure what the best solution is going to be in this case and what exact sequence you are looking to extract. You may want to use your BED file with the original reference genome if you just want to extract intervals present in your BED.