Hi, I want to use ConTExt to analyse TEs in Drosophila. To use the tool, I need to concatenate all the unmapped and alternative contigs in the reference genome to reduce the number of output files. How should I do this? Thank you
Hi, I want to use ConTExt to analyse TEs in Drosophila. To use the tool, I need to concatenate all the unmapped and alternative contigs in the reference genome to reduce the number of output files. How should I do this? Thank you
Then use one of the solutions here: HOw to merge multifasta sequence into a single sequence having only one header?
https://stackoverflow.com/questions/69471751/how-to-concatenate-sequences-in-the-same-multifasta-files-and-then-print-result
Once you have this file you can cat
it at the end of original reference genome.
I'd also like to share an alternative answer that might be less cryptic than the sed command:
# First concatenate all fa files in the directory
seqkit concat -j ${task.cpus} --full ${fasta_dir}/*.fa > temp_concat.fa
# Replace all headers with a single header and remove line breaks in sequence
seqkit replace -j ${task.cpus} -p "^>.+" -r ">combined_sequence" temp_concat.fa | \
seqkit replace -s -p "\n" -r "" > combined.fa
# Combine with reference genome
cat ${ref_genome} combined.fa > final_combined.fa
# Clean up temporary files
rm temp_concat.fa combined.fa
I generated this answer using amplicon.ai, a tool I've been building to iteratively write and execute pipelines easier. Feel free to try it out
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
cat contig1.fa contig2.fa .. contigN.fa > one_file.fa
will concatenate the files into one.Or if you can find any sort of common identifier (here assuming its
*.fa
) you can automatefind . -maxdepth 1 -name '*fa' | xargs cat > one_file.fa
.Thank you. I don’t want to cat some files to one file. I want to put all the unmapped and alternative contigs together, under one header, and add it to the reference genome.