Concatenating all unmapped or alternative contigs in a reference genome
2
0
Entering edit mode
15 days ago

Hi, I want to use ConTExt to analyse TEs in Drosophila. To use the tool, I need to concatenate all the unmapped and alternative contigs in the reference genome to reduce the number of output files. How should I do this? Thank you

contigs fasta reference • 539 views
ADD COMMENT
0
Entering edit mode

cat contig1.fa contig2.fa .. contigN.fa > one_file.fa will concatenate the files into one.

ADD REPLY
0
Entering edit mode

Or if you can find any sort of common identifier (here assuming its *.fa) you can automate find . -maxdepth 1 -name '*fa' | xargs cat > one_file.fa.

ADD REPLY
0
Entering edit mode

Thank you. I don’t want to cat some files to one file. I want to put all the unmapped and alternative contigs together, under one header, and add it to the reference genome.

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

Very helpful. Thank you

ADD REPLY
0
Entering edit mode
14 days ago
Semir ▴ 50

I'd also like to share an alternative answer that might be less cryptic than the sed command:

# First concatenate all fa files in the directory
seqkit concat -j ${task.cpus} --full ${fasta_dir}/*.fa > temp_concat.fa

# Replace all headers with a single header and remove line breaks in sequence
seqkit replace -j ${task.cpus} -p "^>.+" -r ">combined_sequence" temp_concat.fa | \
seqkit replace -s -p "\n" -r "" > combined.fa

# Combine with reference genome
cat ${ref_genome} combined.fa > final_combined.fa

# Clean up temporary files
rm temp_concat.fa combined.fa

I generated this answer using amplicon.ai, a tool I've been building to iteratively write and execute pipelines easier. Feel free to try it out

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 1660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6