How to convert hard-masked genome to soft-masked?
2
1
Entering edit mode
21 months ago
BioinfoBee • 0

Hello All,

I was curious if anyone is aware of any tools or approach that can convert hard-masked genome to soft-masked format. For example,

genome: ....ATGCATGCATGC......

conversion required: ....ATGCNNNNATGC...... to ....ATGCatgcATGC......

Regards,
B

repeatmasking repeats • 2.2k views
ADD COMMENT
4
Entering edit mode
21 months ago

I would first get the coordinates of Ns in the hard-masked genome and output them as a bed file. Here's an example using seqkit.

seqkit locate --bed -rPp "N+" hardmasked.fasta > N_coords.bed

You can then use bedtools to soft mask the non-masked genome.

bedtools maskfasta -soft -fi unmasked.fasta -bed N_coords.bed -fo softmasked.fasta
ADD COMMENT
0
Entering edit mode

rpolicastro Thank you. This works like a magic. Also, as I am dealing with large genome, is there a way to increase the speed of seqkit?. I tried increasing number of threads (-j/--threads) in seqkit but it doesn't seems to work. Kindly suggest!

Regards, B

ADD REPLY
1
Entering edit mode

Besides increasing the number of threads further, the speed of the operation will likely remain slow with large file sizes and a regular expression. The author of the tool frequents biostars, so perhaps they have an idea.

ADD REPLY
2
Entering edit mode

I think when you use -j THREAD, the job will be paralleled only if you have a lot of files to work with.

So if you only have 1 file, it will not be faster.

ADD REPLY
1
Entering edit mode
14 months ago
Dr. Animo ▴ 130

Hi, You can also use the following Python script instead of seqkit to get hardmasked genome coordinates. It is faster than seqkit and then use bedtools as mention rpolicastro

from Bio import SeqIO

# Input and output file paths
input_fasta_file = "hard_masked_genome.fasta"
output_bed_file = "hard_masked_regions.bed"

# Function to write a BED record
def write_bed_record(file, chrom, start, end, name=".", score="."):
    file.write(f"{chrom}\t{start}\t{end}\t{name}\t{score}\n")

# Open the input FASTA file and output BED file
with open(output_bed_file, "w") as bed_file:

    for record in SeqIO.parse(input_fasta_file, "fasta"):
        chrom = record.id  # Chromosome or contig name
        seq = str(record.seq)


        mask_start = None
        mask_end = None

        for i, base in enumerate(seq):
            if base == "N":
                # "N" indicates masked regions
                if mask_start is None:
                    mask_start = i
                mask_end = i
            elif mask_start is not None:

                write_bed_record(bed_file, chrom, mask_start, mask_end + 1)
                mask_start = None
                mask_end = None

        # Check if a masked region ends at the end of the sequence
        if mask_start is not None:
            write_bed_record(bed_file, chrom, mask_start, mask_end + 1)
ADD COMMENT

Login before adding your answer.

Traffic: 1939 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6