Collapse multifasta file by specific chromosome names
1
0
Entering edit mode
2.0 years ago
YOUSEUFS ▴ 30

I have a multicast file with unique identifiers ('SUBJECT.1', 'SUBJECT.2' etc) like this:

>SUBJECT.1.1:1203-2742(+)
AAATTT
>SUBJECT.1:354-700(+)
CCCGGG
>SUBJECT.2:789-2000(+)
GGGCCC
>SUBJECT.2:2012-2742(+)
TTTAAA

how would I extract every line that's associated to each unique identifier and concatenate them together to form an output file that looks like

>SUBJECT.1
AAATTTCCCGGG
>SUBJECT.2
GGGCCCTTTAAA
fasta • 731 views
ADD COMMENT
0
Entering edit mode

Maybe something along these lines?:

1) Simplify headers:

cut -d':' -f1 input.fa > output.fa

2) Concat entries with same IDs using seqkit, specifically, seqkit concat

I'd make 100% sure that the entries in the fasta file are ordered properly before merging, and that you don't have duplicated ids.

ADD REPLY
0
Entering edit mode

seqkit only works when merging two file, this is a single file.

ADD REPLY
2
Entering edit mode
2.0 years ago
iraun 6.2k

Then use awk:

cut -d':' -f1 input.fa > output.fa

awk '/>/ { id = $0 } !/>/ { seq[id] = seq[id] $0 } END { for (id in seq) print id "\n" seq[id] }' output.fa > output_collapsed.fa

In this example I have assumed that the IDs you want to collapse are those before :, please adapt the code to your desired IDs as you consider. And as i said, remember the order of the sequences.

ADD COMMENT
0
Entering edit mode

Thank you! This works perfectly and all ordered correctly.

ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6