I know it is available via Ensembl API but my computer is hard to handle the whole dataset.
Can anyone provide a code or guide me to a possible answer?
If you don't want to download the whole data at once you can stream fastq files from ensembl directly and process.
for chrom in $(echo $(seq 1 22)"\nX\nY\nMT"); do # loop over chromosomes
wget -O- -q "https://ftp.ensembl.org/pub/release-108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.chromosome.$chrom.fa.gz" | # stream the fasta files
zcat | # Unzip the fasta files
sed 1d | # Skip the fasta id line
grep -Ebo '[[:lower:]]+' | # Capture the sm group -E for expanded regex -b for start character position -o for matches only
{IFS=":"; while read start seq; do # Split the grep output
len=$(echo "$seq" | wc -c); # get the length of the sequence
echo "$chrom\t$start\t$(( start + len ))"; # print out the bed
done};
done > sm.bed # write to file
Here is the python script I use, search_fasta_for_regex.py:
"""
Search a fasta file for a given
regex, and output matches coordinates
in BED format
"""
import sys
import re
from Bio import SeqIO
in_fasta = sys.argv[1]
regex = sys.argv[2]
out_bed = sys.argv[3]
regex = re.compile(regex)
with open(out_bed, 'w') as fo:
for rec in SeqIO.parse(in_fasta, 'fasta'):
for m in regex.finditer(str(rec.seq)):
l = [ rec.id, m.start(), m.start() + len(m.group()) ]
l = [str(x) for x in l]
print('\t'.join(l), file=fo)
I then run it like this: python search_fasta_for_regex.py in.fasta "[atgcn]+" out.bed.