rpolicastro Thank you. This works like a magic. Also, as I am dealing with large genome, is there a way to increase the speed of seqkit?. I tried increasing number of threads (-j/--threads) in seqkit but it doesn't seems to work. Kindly suggest!
Besides increasing the number of threads further, the speed of the operation will likely remain slow with large file sizes and a regular expression. The author of the tool frequents biostars, so perhaps they have an idea.
Hi,
You can also use the following Python script instead of seqkit to get hardmasked genome coordinates. It is faster than seqkit and then use bedtools as mention rpolicastro
from Bio import SeqIO
# Input and output file paths
input_fasta_file = "hard_masked_genome.fasta"
output_bed_file = "hard_masked_regions.bed"
# Function to write a BED record
def write_bed_record(file, chrom, start, end, name=".", score="."):
file.write(f"{chrom}\t{start}\t{end}\t{name}\t{score}\n")
# Open the input FASTA file and output BED file
with open(output_bed_file, "w") as bed_file:
for record in SeqIO.parse(input_fasta_file, "fasta"):
chrom = record.id # Chromosome or contig name
seq = str(record.seq)
mask_start = None
mask_end = None
for i, base in enumerate(seq):
if base == "N":
# "N" indicates masked regions
if mask_start is None:
mask_start = i
mask_end = i
elif mask_start is not None:
write_bed_record(bed_file, chrom, mask_start, mask_end + 1)
mask_start = None
mask_end = None
# Check if a masked region ends at the end of the sequence
if mask_start is not None:
write_bed_record(bed_file, chrom, mask_start, mask_end + 1)
rpolicastro Thank you. This works like a magic. Also, as I am dealing with large genome, is there a way to increase the speed of seqkit?. I tried increasing number of threads (-j/--threads) in seqkit but it doesn't seems to work. Kindly suggest!
Regards, B
Besides increasing the number of threads further, the speed of the operation will likely remain slow with large file sizes and a regular expression. The author of the tool frequents biostars, so perhaps they have an idea.
I think when you use
-j THREAD
, the job will be paralleled only if you have a lot of files to work with.So if you only have 1 file, it will not be faster.