Question

find difference sequences with different header

0

Entering edit mode

6.7 years ago

Jason ▴ 10

I have two fasta file with 200 sequences. I want to use shell commands to find difference sequences with different headers between theses two fasta files and save that in new file with heder from file as output.

this command find match sequences even if header is not same. it look at match sequences and skip header.

seqkit common --by-seq --ignore-case file1.fasta file2.fasta file3.fasta > out.fasta

I want to find difference sequences from file 1 not find in file 2 and save that in file 3. we want to compare by sequences not header becasue both file have different header but some sequences same and other are different.

file1:
>NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAES
KSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG
AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGET
VAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT
AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAG
TMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI
FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSA
LKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN
EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQEL
YRNFKSISKALVERGGVVTSNPLGF


>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI
AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSL
MCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE
QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTA
ANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW
EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
AMLIQKL


file2:
>sp|Q8R519|ACMSD_MOUSE 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Mus musculus GN=Acmsd PE=1 SV=2
MKIDIHTHILPKEWPDLEKRFGYGGWVQLQQQGKGEAKMIKDGKLFRVIQQNCWDPEVRI
REMNQKGVTVQALSTVPVMFSYWAKPKDTLELCQFLNNDLAATVARYPRRFVGLGTLPMQ
APELAVEEMERCVKALGFPGIQIGSHINTWDLNDPELFPIYAAAERLNCSLFVHPWDMQM
DGRMAKYWLPWLVGMPSETTMAICSMIMGGVFEKFPKLKVCFAHGGGAFPFTIGRIAHGF
NMRPDLCAQDNPSDPRKYLGSFYTDSLVHDPLSLKLLTDVIGKDKVMLGTDYPFPLGEQE
PGKLIESMAEFDEETKDKLTAGNALAFLGLERKLFE

>sp|P35738|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3
MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAES
KSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG
AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGET
VAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT
AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAG
TMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI
FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSA
LKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN
EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQEL
YRNFKSISKALVERGGVVTSNPLGF

>sp|P26149|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4
MPGWSCLVTGAGGFLGQRIIQLLVQEEDLEEIRVLDKVFRPETRKEFFNLETSIKVTVLE
GDILDTQYLRRACQGISVVIHTAAIIDVTGVIPRQTILDVNLKGTQNLLEACIQASVPAF
IFSSSVDVAGPNSYKEIVLNGHEEECHESTWSDPYPYSKKMAEKAVLAANGSMLKNGGTL
QTCALRPMCIYGERSPLISNIIIMALKHKGILRSFGKFNTANPVYVGNVAWAHILAARGL
RDPKKSPNIQGEFYYISDDTPHQSFDDISYTLSKEWGFCLDSSWSLPVPLLYWLAFLLET
VSFLLSPIYRYIPPFNRHLVTLSGSTFTFSYKKAQRDLGYEPLVSWEEAKQKTSEWIGTL
VEQHRETLDTKSQ




result file 3

>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI
AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSL
MCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE
QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTA
ANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW
EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
AMLIQKL

sequence • 2.6k views

ADD COMMENT • link updated 6.7 years ago by cpad0112 21k • written 6.7 years ago by Jason ▴ 10

0

Entering edit mode

6.7 years ago

Pierre Lindenbaum 164k

linearize both files, sort and comm on sequence:

comm -3 \
      <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' f1.fasta | cut -f 2 | sort)  \
      <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' f2.fasta | cut -f 2 | sort)

ADD COMMENT • link 6.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hey, This code did not do what I expect it.

thanks

ADD REPLY • link 6.7 years ago by Jason ▴ 10

0

Entering edit mode

6.7 years ago

cpad0112 21k

output:

   $ seqkit grep -svf <(seqkit seq -s -w 0 test2.fa) <(seqkit seq -w 0 test1.fa)

  >NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
    MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLS
    LLPATKLGSIAIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCT
    TINKVCASGMKAIMMASQSLMCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIV
    KDGLTDVYNKIHMGSCAENTAKKLNIARNEQDAYAINSYTRSKAAWEAGKFGNEVIPVTV
    TVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTAANASTLNDGAAALVLMTADA
    AKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMWEVNEAFSLVV
    LANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
    AMLIQKL

input (not pasting the sequences):

$ seqkit seq -n test2.fa 
sp|Q8R519|ACMSD_MOUSE 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Mus musculus GN=Acmsd PE=1 SV=2
sp|P35738|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3
sp|P26149|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4

$ seqkit seq -n test1.fa 
NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]

ADD COMMENT • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Hi, This code will do same mistake because some sequences in file 2 have different id but same identical sequences in file 1.

We get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa, but the problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result3. I want to save in file 3 all sequences from file 1 not find in file 2.

ADD REPLY • link 6.7 years ago by Jason ▴ 10

0

Entering edit mode

Well, I think this reply meant for finswimmer post above.

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Hi, I also tested your code. It gave me some sequences find in both file 1 and file 2. it will not save sequences in file1 that not find in file 2.

The problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result. I want to save in file 3 all sequences from file 1 not find in file 2.

ADD REPLY • link 6.7 years ago by Jason ▴ 10

GenoMax · Accepted Answer · 2018-04-06

3

Entering edit mode

6.7 years ago

finswimmer 16k

Hello,

what about:

cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fa

fin swimmer

EDIT:

Another way is to first find all sequences they have in common, extract there IDs and than look in your first file for all sequences that do not match these IDs:

Find common sequences between file1 and file2 and extract ther IDs:

seqkit common -s file1.fa file2.fa|grep '>'|cut -c2- > common_ids

Get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa:

seqkit grep file1.fa -v -n -f common_ids -o file3.fa

ADD COMMENT • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Hey,

When I used this command it gave me this error and i'm sure about the name of files: cat file1.fasta file2.fasta|seqkit rmdup seqkit rmdup -s -o file3.fa

[ERRO] fastx: open seqkit: no such file or directory

How can I resolve this problem?

If I change the command to this command cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fasta [INFO] 3 duplicated records removed

It will remove duplicate files and will not save the difference in file 3

thanks

ADD REPLY • link 6.7 years ago by Jason ▴ 10

0

Entering edit mode

Oops, I copied the command twice. See me updated one.

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Hey, I applied this code

cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fa

the result saved in file3 as :

>NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens] MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAESKSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGETVAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAGTMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSALKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQELYRNFKSISKALVERGGVVTSNPLGF

This is not what I expected because this sequence found in both file 1 and file 2. I want to save sequence find in file1 but not find in file 2

the expect result should be like this in file 3

>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens] MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSLMCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTAANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGASAMLIQKL

thanks

ADD REPLY • link updated 6.7 years ago by GenoMax 148k • written 6.7 years ago by Jason ▴ 10

0

Entering edit mode

Hm, strange. You're right. I think this shouldn't happen.

[I moved my answer as an edit in the first post.]

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Thank you so much for your hep

ADD REPLY • link 6.7 years ago by Jason ▴ 10

0

Entering edit mode

Fine if I could help you.

I moved my answers as an edit to my first post. So if you think this is your solution you can mark this post as accepted.

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Hello,

After I made some analysis for many sequences I found this code will hold some common sequences in both file 1 and file 2 and save that in file 3

Find common sequences between file1 and file2 and extract ther IDs: seqkit common -s file1.fa file2.fa|grep '>'|cut -c2- > common_ids

Get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa: seqkit grep file1.fa -v -n -f common_ids -o file3.fa

We get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa, but the problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result3. I want to save in file 3 all sequences from file 1 not find in file 2.

ADD REPLY • link 6.7 years ago by Jason ▴ 10

0

Entering edit mode

Hello Jason,

please use the code tag to mark code within your post. It is much more readable.

I'm confused with your answer. seqkit common -s file1.fa file2.fa is doing a comparison by sequence (ignoring differences in the ID) and give you all sequences back that are in both files. These sequences get the ID from the first file. With grep and cut we extract these IDs to remove those sequences from file1 in the next step leaving behind only sequences that are in file1 but not in file2.

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k