Question

Concatenating all unmapped or alternative contigs in a reference genome

0

Entering edit mode

7 months ago

manikin_python9f • 0

Hi, I want to use ConTExt to analyse TEs in Drosophila. To use the tool, I need to concatenate all the unmapped and alternative contigs in the reference genome to reduce the number of output files. How should I do this? Thank you

contigs fasta reference • 1.1k views

ADD COMMENT • link updated 7 months ago by Semir ▴ 60 • written 7 months ago by manikin_python9f • 0

0

Entering edit mode

cat contig1.fa contig2.fa .. contigN.fa > one_file.fa will concatenate the files into one.

ADD REPLY • link 7 months ago by GenoMax 152k

0

Entering edit mode

Or if you can find any sort of common identifier (here assuming its *.fa) you can automate find . -maxdepth 1 -name '*fa' | xargs cat > one_file.fa.

ADD REPLY • link 7 months ago by ATpoint 88k

0

Entering edit mode

Thank you. I don’t want to cat some files to one file. I want to put all the unmapped and alternative contigs together, under one header, and add it to the reference genome.

ADD REPLY • link 7 months ago by manikin_python9f • 0

score 1 · Answer 1 · 2024-12-06

1

Entering edit mode

7 months ago

GenoMax 152k

Then use one of the solutions here: HOw to merge multifasta sequence into a single sequence having only one header?
https://stackoverflow.com/questions/69471751/how-to-concatenate-sequences-in-the-same-multifasta-files-and-then-print-result

Once you have this file you can cat it at the end of original reference genome.

ADD COMMENT • link 7 months ago by GenoMax 152k

0

Entering edit mode

Very helpful. Thank you

ADD REPLY • link 7 months ago by manikin_python9f • 0

score 0 · Answer 2 · 2024-12-07

I'd also like to share an alternative answer that might be less cryptic than the sed command:

# First concatenate all fa files in the directory
seqkit concat -j ${task.cpus} --full ${fasta_dir}/*.fa > temp_concat.fa

# Replace all headers with a single header and remove line breaks in sequence
seqkit replace -j ${task.cpus} -p "^>.+" -r ">combined_sequence" temp_concat.fa | \
seqkit replace -s -p "\n" -r "" > combined.fa

# Combine with reference genome
cat ${ref_genome} combined.fa > final_combined.fa

# Clean up temporary files
rm temp_concat.fa combined.fa

I generated this answer using amplicon.ai, a tool I've been building to iteratively write and execute pipelines easier. Feel free to try it out

enter image description here