Entering edit mode
2.0 years ago
YOUSEUFS
▴
30
I have a multicast file with unique identifiers ('SUBJECT.1', 'SUBJECT.2' etc) like this:
>SUBJECT.1.1:1203-2742(+)
AAATTT
>SUBJECT.1:354-700(+)
CCCGGG
>SUBJECT.2:789-2000(+)
GGGCCC
>SUBJECT.2:2012-2742(+)
TTTAAA
how would I extract every line that's associated to each unique identifier and concatenate them together to form an output file that looks like
>SUBJECT.1
AAATTTCCCGGG
>SUBJECT.2
GGGCCCTTTAAA
Maybe something along these lines?:
1) Simplify headers:
2) Concat entries with same IDs using seqkit, specifically,
seqkit concat
I'd make 100% sure that the entries in the fasta file are ordered properly before merging, and that you don't have duplicated ids.
seqkit only works when merging two file, this is a single file.