Issue when trying to extract sequences from a genome using bed coordinates and samtools faidx
1
0
Entering edit mode
2.7 years ago
epi_vincent ▴ 10

Hi,

I am trying to convert ChIP peak coordinates (.bed output from Pepr or sicer) to fasta sequences for motif analysis using RSAT.

To that end, I am using faidx from samtools.

Here is the head of my bed file that contains the coordinates from which I want to extract the sequences:

head peaks.bed
#Chromosome Start End
NC_029516.1 16800 20599
NC_029516.1 21200 22799

This is the command I typed to extract the sequences from NCBI quail genome:

samtools faidx GCF_001577835.2_Coturnix_japonica_2.1_genomic.fna -r peaks.bed -o peaks.fna

However, I get this error message:

[W::fai_get_val] Reference #Chromosome Start End not found in file, returning empty sequence
[faidx] Failed to fetch sequence in #Chromosome Start End

For information, this is the head of my genome:

head GCF_001577835.2_Coturnix_japonica_2.1_genomic.fna
>NC_029516.1 Coturnix japonica isolate 7356 chromosome 1, Coturnix japonica 2.1, whole genome shotgun sequence
GAGCTGTTggagggtccagaggaggcgCTGAgatgggcagagctggagcagctccgctatgaggaaaggctggggGAGAT
GGGATGGTTTAGATTGGAGAAGGATGCGGGGAGACCTCAGTGTGGGCatccaatacttgaagggagtgtataaacaggag

Can you help me solving the issue?

Thanks!

Vincent

faidx • 983 views
ADD COMMENT
0
Entering edit mode
2.7 years ago
iraun 6.2k

Try removing #Chromosome Start End from your bed file.

ADD COMMENT
0
Entering edit mode

Thanks, however the issue is still there

[W::fai_get_val] Reference NC_029516.1 16800 20599 not found in file, returning empty sequence
[faidx] Failed to fetch sequence in NC_029516.1 16800 20599

I wonder if this has sommeting to do with the format of my bed file, as it is tab spaced and not using : and - signs. If that is the issue, any way to do the conversion? (help of faidx: -r, --region-file FILE Read regions from a file. Format is chr:from-to, one per line)

ADD REPLY
1
Entering edit mode

Right, that must be the issue, you can try something like this and see if it works:

awk -F'\t' '{print $1":"$2"-"$3}' INPUT_FILE > OUTPUT_FILE
ADD REPLY
1
Entering edit mode

This worked for me:

awk -F " " '{print $1":"$2"-"$3}' peaks.bed > peaks_OK.bed

If I used "\t" it did put :- at the end of each line. I guess columns are space delimited and not tab delimited. And this worked! I could extract my genomic regions using faidx.

Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 1313 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6