How to extract repeat element coordinates from an Ensembl genome?
2
0
Entering edit mode
2.2 years ago
cchen635 • 0

I am trying to extract the repeat element coordinates, or bed specifically, from an Ensembl soft-masked fasta with format:

>1 dna_rm:primary_assembly primary_assembly:bTaeGut1_v1.p:1:1:114557415:1 REF
AAGCCAGCCATATCAGTATCCCAGCCGCGCAGGATCTGAGCGCCACCCAGCAATGGCCAG
CCACTATGTGTGCCCCTGTATCACTGGAATTCAAGGCCCACCACCCTGTTCTCAGCCATA

I know it is available via Ensembl API but my computer is hard to handle the whole dataset. Can anyone provide a code or guide me to a possible answer?

Ensembl repeat-element genome • 923 views
ADD COMMENT
2
Entering edit mode
2.1 years ago
barslmn ★ 2.3k

If you don't want to download the whole data at once you can stream fastq files from ensembl directly and process.

for chrom in $(echo $(seq 1 22)"\nX\nY\nMT"); do  # loop over chromosomes
    wget -O- -q "https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.chromosome.$chrom.fa.gz" | # stream the fasta files
    zcat | # Unzip the fasta files
    sed 1d |  # Skip the fasta id line
    grep -Ebo '[[:lower:]]+' | # Capture the sm group -E for expanded regex -b for start character position -o for matches only
    {IFS=":"; while read start seq; do # Split the grep output
        len=$(echo "$seq" | wc -c);  # get the length of the sequence
        echo "$chrom\t$start\t$(( start + len ))"; # print out the bed
    done};
done > sm.bed # write to file
ADD COMMENT
0
Entering edit mode
2.1 years ago
liorglic ★ 1.5k

Here is the python script I use, search_fasta_for_regex.py:

"""
Search a fasta file for a given
regex, and output matches coordinates
in BED format
"""

import sys
import re
from Bio import SeqIO

in_fasta = sys.argv[1]
regex = sys.argv[2]
out_bed = sys.argv[3]


regex = re.compile(regex)
with open(out_bed, 'w') as fo:
  for rec in SeqIO.parse(in_fasta, 'fasta'):
    for m in regex.finditer(str(rec.seq)):
      l = [ rec.id, m.start(), m.start() + len(m.group()) ]
      l = [str(x) for x in l]
      print('\t'.join(l), file=fo)

I then run it like this: python search_fasta_for_regex.py in.fasta "[atgcn]+" out.bed.

ADD COMMENT

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6