Question

Issue when trying to extract sequences from a genome using bed coordinates and samtools faidx

0

Entering edit mode

3.3 years ago

epi_vincent ▴ 10

Hi,

I am trying to convert ChIP peak coordinates (.bed output from Pepr or sicer) to fasta sequences for motif analysis using RSAT.

To that end, I am using faidx from samtools.

Here is the head of my bed file that contains the coordinates from which I want to extract the sequences:

head peaks.bed
#Chromosome Start End
NC_029516.1 16800 20599
NC_029516.1 21200 22799

This is the command I typed to extract the sequences from NCBI quail genome:

samtools faidx GCF_001577835.2_Coturnix_japonica_2.1_genomic.fna -r peaks.bed -o peaks.fna

However, I get this error message:

[W::fai_get_val] Reference #Chromosome Start End not found in file, returning empty sequence
[faidx] Failed to fetch sequence in #Chromosome Start End

For information, this is the head of my genome:

head GCF_001577835.2_Coturnix_japonica_2.1_genomic.fna
>NC_029516.1 Coturnix japonica isolate 7356 chromosome 1, Coturnix japonica 2.1, whole genome shotgun sequence
GAGCTGTTggagggtccagaggaggcgCTGAgatgggcagagctggagcagctccgctatgaggaaaggctggggGAGAT
GGGATGGTTTAGATTGGAGAAGGATGCGGGGAGACCTCAGTGTGGGCatccaatacttgaagggagtgtataaacaggag

Can you help me solving the issue?

Thanks!

Vincent

faidx • 1.3k views

ADD COMMENT • link 3.3 years ago by epi_vincent ▴ 10

score 0 · Answer 1 · 2022-04-06

0

Entering edit mode

3.3 years ago

iraun 6.2k

Try removing #Chromosome Start End from your bed file.

ADD COMMENT • link 3.3 years ago by iraun 6.2k

0

Entering edit mode

Thanks, however the issue is still there

[W::fai_get_val] Reference NC_029516.1 16800 20599 not found in file, returning empty sequence
[faidx] Failed to fetch sequence in NC_029516.1 16800 20599

I wonder if this has sommeting to do with the format of my bed file, as it is tab spaced and not using : and - signs. If that is the issue, any way to do the conversion? (help of faidx: -r, --region-file FILE Read regions from a file. Format is chr:from-to, one per line)

ADD REPLY • link 3.3 years ago by epi_vincent ▴ 10

1

Entering edit mode

Right, that must be the issue, you can try something like this and see if it works:

awk -F'\t' '{print $1":"$2"-"$3}' INPUT_FILE > OUTPUT_FILE

ADD REPLY • link 3.3 years ago by iraun 6.2k

1

Entering edit mode

This worked for me:

awk -F " " '{print $1":"$2"-"$3}' peaks.bed > peaks_OK.bed

If I used "\t" it did put :- at the end of each line. I guess columns are space delimited and not tab delimited. And this worked! I could extract my genomic regions using faidx.

Thank you very much!

ADD REPLY • link 3.3 years ago by epi_vincent ▴ 10