How to extract and concatenate fasta lines that match substring?
2
0
Entering edit mode
2.1 years ago
YOUSEUFS ▴ 30

I have a list of unique identifiers

identifiers = ['subject_1', 'subject_2'] 

and a multi-fasta file containing

>CDS::subject_1::123
AAATTT
>CDS::subject_1::354
CCCGGG
>CDS::subject_2::789
GGGCCC
>CDS::subject_2::765
TTTAAA

how would I extract every line that's associated to each unique identifier and concatenate them together to form an output file that looks like

>subject_1
AAATTTCCCGGG
>subject_2
GGGCCCTTTAAA
fasta python • 707 views
ADD COMMENT
2
Entering edit mode
2.1 years ago

seqkit and csvtk answer

seqkit replace -p ".+::(\S+):.+" -r "\$1" test.fasta |
  seqkit fx2tab |
  csvtk fold -tH -f1 -v2 -s"," |
  sed 's/,//g' |
  seqkit tab2fx
ADD COMMENT
0
Entering edit mode
2.1 years ago
cat input.fa  | paste - - | sed 's/>CDS:://;s/::[^\t]*//' | awk '{seq[$1]=sprintf("%s%s",seq[$1],$2);} END{for(n in seq) printf(">%s\n%s\n",n,seq[n]);}'

>subject_1
AAATTTCCCGGG
>subject_2
GGGCCCTTTAAA
ADD COMMENT
0
Entering edit mode

I'm having trouble getting this to work. Perhaps I should have stated more clearly, the fast-file looks more like

>CDS::NC_005291.1:5877-7537(-)
AAATTT
>CDS::NC_005291.1:7650-7800(-)
CCCGGG
>CDS::NC_007641.1:5877-7537(-)
AAATTT
>CDS::NC_007641.1:7650-7800(-)
CCCGGG

Which I'm trying to turn into

>NC_005291.1
AAATTTCCCGGG
>NC_007641.1
AAATTTCCCGGG
ADD REPLY
0
Entering edit mode

change the sed expression...

ADD REPLY

Login before adding your answer.

Traffic: 2434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6