I have two fasta file with 200 sequences. I want to use shell commands to find difference sequences with different headers between theses two fasta files and save that in new file with heder from file as output.
this command find match sequences even if header is not same. it look at match sequences and skip header.
seqkit common --by-seq --ignore-case file1.fasta file2.fasta file3.fasta > out.fasta
I want to find difference sequences from file 1 not find in file 2 and save that in file 3. we want to compare by sequences not header becasue both file have different header but some sequences same and other are different.
file1:
>NP_000009.1 very long-chain specific acyl-CoA dehydrogenase, mitochondrial isoform 1 precursor [Homo sapiens]
MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAES
KSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG
AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGET
VAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT
AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAG
TMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI
FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSA
LKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN
EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQEL
YRNFKSISKALVERGGVVTSNPLGF
>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI
AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSL
MCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE
QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTA
ANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW
EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
AMLIQKL
file2:
>sp|Q8R519|ACMSD_MOUSE 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Mus musculus GN=Acmsd PE=1 SV=2
MKIDIHTHILPKEWPDLEKRFGYGGWVQLQQQGKGEAKMIKDGKLFRVIQQNCWDPEVRI
REMNQKGVTVQALSTVPVMFSYWAKPKDTLELCQFLNNDLAATVARYPRRFVGLGTLPMQ
APELAVEEMERCVKALGFPGIQIGSHINTWDLNDPELFPIYAAAERLNCSLFVHPWDMQM
DGRMAKYWLPWLVGMPSETTMAICSMIMGGVFEKFPKLKVCFAHGGGAFPFTIGRIAHGF
NMRPDLCAQDNPSDPRKYLGSFYTDSLVHDPLSLKLLTDVIGKDKVMLGTDYPFPLGEQE
PGKLIESMAEFDEETKDKLTAGNALAFLGLERKLFE
>sp|P35738|ODBB_RAT 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial OS=Rattus norvegicus GN=Bckdhb PE=1 SV=3
MQAARMAASLGRQLLRLGGGSSRLTALLGQPRPGPARRPYAGGAAQLALDKSDSHPSDALTRKKPAKAES
KSFAVGMFKGQLTTDQVFPYPSVLNEEQTQFLKELVEPVSRFFEEVNDPAKNDALEMVEETTWQGLKELG
AFGLQVPSELGGVGLCNTQYARLVEIVGMHDLGVGITLGAHQSIGFKGILLFGTKAQKEKYLPKLASGET
VAAFCLTEPSSGSDAASIRTSAVPSPCGKYYTLNGSKLWISNGGLADIFTVFAKTPVTDPATGAVKEKIT
AFVVERGFGGITHGPPEKKMGIKASNTAEVFFDGVRVPSENVLGEVGSGFKVAMHILNNGRFGMAAALAG
TMRGIIAKAVDHATNRTQFGEKIHNFGLIQEKLARMVMLQYVTESMAYMVSANMDQGATDFQIEAAISKI
FGSEAAWKVTDECIQIMGGMGFMKEPGVERVLRDLRIFRIFEGTNDILRLFVALQGCMDKGKELSGLGSA
LKNPFGNAGLLLGEAGKQLRRRAGLGSGLSLSGLVHPELSRSGELAVRALEQFATVVEAKLIKHKKGIVN
EQFLLQRLADGAIDLYAMVVVLSRASRSLSEGHPTAQHEKMLCDTWCIEAAARIREGMAALQSDPWQQEL
YRNFKSISKALVERGGVVTSNPLGF
>sp|P26149|3BHS2_MOUSE 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Mus musculus GN=Hsd3b2 PE=1 SV=4
MPGWSCLVTGAGGFLGQRIIQLLVQEEDLEEIRVLDKVFRPETRKEFFNLETSIKVTVLE
GDILDTQYLRRACQGISVVIHTAAIIDVTGVIPRQTILDVNLKGTQNLLEACIQASVPAF
IFSSSVDVAGPNSYKEIVLNGHEEECHESTWSDPYPYSKKMAEKAVLAANGSMLKNGGTL
QTCALRPMCIYGERSPLISNIIIMALKHKGILRSFGKFNTANPVYVGNVAWAHILAARGL
RDPKKSPNIQGEFYYISDDTPHQSFDDISYTLSKEWGFCLDSSWSLPVPLLYWLAFLLET
VSFLLSPIYRYIPPFNRHLVTLSGSTFTFSYKKAQRDLGYEPLVSWEEAKQKTSEWIGTL
VEQHRETLDTKSQ
result file 3
>NP_000010.1 acetyl-CoA acetyltransferase, mitochondrial precursor [Homo sapiens]
MAVLAALLRSGARSRSPLLRRLVQEIRYVERSYVSKPTLKEVVIVSATRTPIGSFLGSLSLLPATKLGSI
AIQGAIEKAGIPKEEVKEAYMGNVLQGGEGQAPTRQAVLGAGLPISTPCTTINKVCASGMKAIMMASQSL
MCGHQDVMVAGGMESMSNVPYVMNRGSTPYGGVKLEDLIVKDGLTDVYNKIHMGSCAENTAKKLNIARNE
QDAYAINSYTRSKAAWEAGKFGNEVIPVTVTVKGQPDVVVKEDEEYKRVDFSKVPKLKTVFQKENGTVTA
ANASTLNDGAAALVLMTADAAKRLNVTPLARIVAFADAAVEPIDFPIAPVYAASMVLKDVGLKKEDIAMW
EVNEAFSLVVLANIKMLEIDPQKVNINGGAVSLGHPIGMSGARIVGHLTHALKQGEYGLASICNGGGGAS
AMLIQKL
Hey,
When I used this command it gave me this error and i'm sure about the name of files: cat file1.fasta file2.fasta|seqkit rmdup seqkit rmdup -s -o file3.fa
[ERRO] fastx: open seqkit: no such file or directory
How can I resolve this problem?
If I change the command to this command cat file1.fasta file2.fasta|seqkit rmdup -s -o file3.fasta [INFO] 3 duplicated records removed
It will remove duplicate files and will not save the difference in file 3
thanks
Oops, I copied the command twice. See me updated one.
Hey, I applied this code
the result saved in file3 as :
This is not what I expected because this sequence found in both file 1 and file 2. I want to save sequence find in file1 but not find in file 2
the expect result should be like this in file 3
thanks
Hm, strange. You're right. I think this shouldn't happen.
[I moved my answer as an edit in the first post.]
fin swimmer
Thank you so much for your hep
Fine if I could help you.
I moved my answers as an edit to my first post. So if you think this is your solution you can mark this post as accepted.
fin swimmer
Hello,
After I made some analysis for many sequences I found this code will hold some common sequences in both file 1 and file 2 and save that in file 3
Find common sequences between file1 and file2 and extract ther IDs: seqkit common -s file1.fa file2.fa|grep '>'|cut -c2- > common_ids
Get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa: seqkit grep file1.fa -v -n -f common_ids -o file3.fa
We get all sequences from file1 that do not match the IDs in common_ids and store the result in file3.fa, but the problem here some sequences in file 2 have different index but same identical sequences. That mean some sequences from file 2 will be in result3. I want to save in file 3 all sequences from file 1 not find in file 2.
Hello Jason,
please use the code tag to mark code within your post. It is much more readable.
I'm confused with your answer.
seqkit common -s file1.fa file2.fa
is doing a comparison by sequence (ignoring differences in the ID) and give you all sequences back that are in both files. These sequences get the ID from the first file. Withgrep
andcut
we extract these IDs to remove those sequences from file1 in the next step leaving behind only sequences that are in file1 but not in file2.fin swimmer