sequence extraction from multifast file using IDs in another text file
2
0
Entering edit mode
9.3 years ago
tcf.hcdg ▴ 70

Hello

I have a multi fast file and want to extract some of the sequence from big multifasta file. I tried the grep function in linux. It gives the out put results.

HP-Pavilion-dv6-Notebook-PC ~/bioinformatics/tu/fastaretrival $ grep -A1 -wFf id.txt input.fasta > result.fa

But there are unexpected blanks in the result file with -- almost after every sequence.

id file is as follow

>C_c04661_3
>C_i00001_3
>C_i00001_6
>C_i00002_4
>C_i00007_4
>C_i00008_3
>C_i00009_4
>C_i00011_2
>C_i00012_5
>C_i00013_6

The input file is as follows

>C_c04661_2
KKETFHSIQNVWIDNNHLTETVKFYENAVHEKRIRIFLELISHTIRLACSILVQMAFLPHHSSFLELQSLWSLTCVIQISHIPLCSSSGKDILHSLPQCFHPSHSSHCGLLLGDLKRSNQHASMDLFPVVGSPQHPESHLLPFFSSLPMLVQPQQVCVVPYHCHTNTVLLPVDYPEHPNPSVSPFLLTFPRLLWQDPLHSQKEQKFQIHLHLCTRFWGSILSHFPTLGSLQNISCSNHLSLSVEELEQQEGPEELKEISHHECAMRHLHLITLSCAPQCEPNPYPCIQHPPSIYDHQPWSQSHQHHPSHLIQEFQFHWQARQHQQQVSQGKKCCLFPGSVHTCVRHQKDSRSPYNQSPLSLCTEACSTQQMAPSFHHQHGNHKEQSFAQIPQNNIYELPPSYQPVQGGGLIGQVAINPLSFSIQKVSTVKDCAVLYPWLFRKMQIEPKRPYPTKKEIGFYIYVCVYASARHREIMYPNMCTAVEKKIIIIKK
>C_c04661_3
RKKHFIPFRMCGKTIEISILKQSSFMKMLSMRNELEYFWNATPKSGWLVVFCRYKWHFCDSPIIVLSSNCNLNFGPLVSRFHISHSYVLLQARTKSSTRDFPNVFTLLILNLTVASCWGTRDQTNTPDRWTCFQLFEDHLSTQKATFCPSSLPYSPCSSLNRCDVLFHIIAIQTQSCFQSKTIPSTQTRQSHLSFLFQDCFGKIHCILRRNRNFKSIFTCVPGSGEEAFAIDFQHSEVSKIYLAQIIFHVLKSWSSKRALKSKRSLIMNVPDISTFRFEHFHVLLNVSQILIPASSIDHQVYMIISNLGDHRVINSTTLLICKYSKSSSSIGKRGNISNNKFLKERNAVFSLEAQSTHVDIKKTPVLPTIDSRVHDVFVLNRHAPPSKWHHLSTISNMEIIKNSLFELRFRRIISTSYRLLISLCKAAGDLAKLPTPSHFQYKKYRLLRIVQYCILGFSEKCKSESPKDHIPPRKRLAFIYMCVYMPLQGTEERCIQICVLQKKKK
>C_c04661_4
FFIIIIFFSTAVHIFGYIISLQCLAEAYTHTYIKPISFLVGYGLLGSQICIFLKSQGYSTAQSLTVDTFCIENERGFMATWPINHPPPCTGEGGSSILFCGIAQKDCSLFPCCWWKDGAICWVEHAYSVQRLNHGLYCQLGERESFCLTHVWTEPQGKRQHFFPETCYCCCLACQWNWNSYCIYKEGWCYLCDHQGYSYILGGQCWMQGGFGSHGAHESVQIRWRCLMAHSEISFSSSGPSCCSNSSTLNERFEQDIFWRLPSVGNQWLKMLPHQNRVHRRWINFCSFECSGSCQSNLGKVRRKGETDGFGCSGSTGSRTVFVWQYGTTHHTCGWTNMGYREEKKGRRWLSGCGDPQTTGNRSIDQACWFDLFRSPSKRPQDEEGKHWGSHEWRISLPEEEHSYGICEIWITQVRDQSDCSSRKELWGYHKNAICTYKILQANLIVWLISSRNILIRFSWTAFSNLTVSVKCFQLSIYHTFMENVSFF
>C_c04661_5
FFYYYYFFFYCSTHIWIHYLSSVPCRGIYTHIYIKANLFLGGIWSFGLSDLHFSEKPRIQYCTILNSRYFLYKEGVYGNLANQSPAALHRLIRRRLVDIILRNLSSKRLFFMISMLLMVERWCHLLGGACLFSTKTQSWTLLSIVGRTGVFLMSHTCVDASREKTAFLSLRNLLLLMLPRLPMELELLLYLQMRRVVLLMTLSPRLLMIIYTWWSMLDAGIRIWLTLRSTKCSNLKVEMSHGTFMMRDLFLFRALLLLQLFNTKMIARYILETSECWKSMAQNASSPEPGTQVKMDLKFLFLLRMQWILPKQSWKSQKERDRVWVLGIVLDWKQDCVCMAMIWNNTSHLLRLDHGLGREEGQKVAFWVLRSSNNWKQVHRSGVLVSLQVPQQEATVRLRMRRVKTLGKSRVEDLVLARRTLWDMNLDHTSQGPKLRLQFEERTMMGLSQKCHLYLQNTTSQPDLGVAYQFQKYSNSFLMDSIFIKLDCFSQMLISIVYLPHILNGMKCFFL
>C_c04661_6
FLLLLFFFLLQYTYLDTLSLFSALQRHIHTHIYKSQSLSWWDMVFWALRFAFFKAKDTVLHNPQSILFVLKMRGGLWQLGQSITRRLAQADKKAVARRYYSAESELKKTVLYDFHVANGGKMVPFAGWSMPIQYKDSIMDSTVNCRENGSLFDVSHMCGLSLKGKDSISFLEKLVIADVASLANGTGTLTVFTNEKGGAIDDSVITKVTDDHIYLVVNAGCRDKDLAHIEEHMKVFKSKGGDVSWHIHDERSLLALQGPLAAPTLQHLMKDDLSKIYFGDFRVLEINGSKCFLTRTGYTGEDGFEISVPSENAVDLAKAILEKSEGKVRLTGLGARDSLRLEAGLCLYGNDMEQHITPVEAGLTWAIGKRRRAEGGFLGAEVILKQLETGPSIRRVGLISSGPPARGHSEIKNEKGENIGEVTSGGFSPCLKKNIAMGYVKSGSHKSGTKVKIAVRGKNYDGAITKMPFVPTKYYKPTFRCGLSVPEIFFVSHGQHFHKTLFQSNANFNCLFTTHSEWNEMFLSS
>C_i00001_1
TKIYDSLIKQKIKISVKMNNQERMKTLTHFPCSHPLCVLSKWFKVVKQICNIHLLVLRPEIYKFKNLSCSKSKLNNALIALTELALMSGLSTKLTFNIRQVKEGKLLQRIIYIEALRHSTLRKITHSLLHSQDTFLVCIFVCLGDKRKFFWSLHCSFSDLCAHRLLVPRCYLYFAFTVTIKLCCSKFFLHSYCFNLDNHLCLLHSLIKILHVSKNSCLCRGFFTLIIIVLLSSETINTKRPTVLLKTSISRPTNLLTNYICSIYITQHNGAISCPLSDWWVGNRRSHFQSSDEERERERERERERPWLVYRDLHRLEDKARQVWYGTIGSWESYINVKKKNKKENKPKKKLTEEQEVIRRSTLTHSNPLNQSIPLIEADPTEDEATEQARLRRSNRLRQKYPPVVYAIHLANPLRIRVIGERRQLLLVLRVSADRGNRWWEFVPAKIWNILGYVLYYWRLILWIGFDDAVFVFIGLGICYLEKLLLGFLLWWLNSVQCSSCFCFSIRSNRDGLGWGFHVITCVNLLLFLGFFLKKNCNLFVIIIKKIL
>C_i00001_2
PNKSMIHLLNDKNKKSNQSKTTRRGRHHIFPVHIHYSVFFLSGSKNKYATFIYYDQRYISSKTSPVQRVNTMLNLSPLQSHLCLVVQNHLILDKKRGNNYFSASFTLRPGIPKLCARSLILSFIAKTLFWCASLYASVTRRESFSGPNSIAASVISVRTGCWFPDSVTFISSPSQPSNSAAASSFCTRTASTWTTICAFFTVSRFCMFPKIPAFAGASSPLSLLSFFPPNRPTQRDLRFCLKLRYPDQQICLPTTFAVEFTLHSDIEMEPFLVHSVTGGFDRGIVEAIFRVRIEKRERERERERERGHGWFTEICNIVKTRLVRYGMGRVLGRATSTESKRRTRKKTSPRKNSRRSRRSDGRRPTQIPTNQYRKPIQRRTRLPNRQGSSGDRTAFAKSIRLWFMQSIWQIHEESENEDNYYYEAQIAVTGGGSLYRRRFGIFLGMFCIIGDFNYGSDLMMQSSSSDVNEFVINKNYYVFFYGGIVFNVHHVFVSQLDLIEMVWAGGFMLHVICYSSVFFKKTVIFLYKLLLRRY
>C_i00001_3
QINLFTYMTKINKNLISQNEQPGEDEDIDTFSLFTSTIVCSFVVQSSETNMQHSFISIETRDIVQKPLLFKEIKQCLTYRPYRVSTYVWFEYKIDIYTSKRGEIITSAHHLHGLEAFLNSAQDHSFSPSPRHFSGVHLCMPRLEEKVFLVLTPLQLQSLCAQVAGSQIVLPLLVRLHSNHQTLLQQVLSALVLLQLGQPFVPSSQSHKDSACFQKFLPLQGLLHPYHYCPSFLRIDHKHKETYGFANFDIQTNKSAYQLHLQLNLHYTVTLKWSHFLSTQLVGLIGESSKPFSEFGLRREREREREREREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDT
>C_i00001_4

Result file with unexpected blank lines with ------

>C_c04661_3         RKKHFIPFRMCGKTIEISILKQSSFMKMLSMRNELEYFWNATPKSGWLVVFCRYKWHFCDSPIIVLSSNCNLNFGPLVSRFHISHSYVLLQARTKSSTRDFPNVFTLLILNLTVASCWGTRDQTNTPDRWTCFQLFEDHLSTQKATFCPSSLPYSPCSSLNRCDVLFHIIAIQTQSCFQSKTIPSTQTRQSHLSFLFQDCFGKIHCILRRNRNFKSIFTCVPGSGEEAFAIDFQHSEVSKIYLAQIIFHVLKSWSSKRALKSKRSLIMNVPDISTFRFEHFHVLLNVSQILIPASSIDHQVYMIISNLGDHRVINSTTLLICKYSKSSSSIGKRGNISNNKFLKERNAVFSLEAQSTHVDIKKTPVLPTIDSRVHDVFVLNRHAPPSKWHHLSTISNMEIIKNSLFELRFRRIISTSYRLLISLCKAAGDLAKLPTPSHFQYKKYRLLRIVQYCILGFSEKCKSESPKDHIPPRKRLAFIYMCVYMPLQGTEERCIQICVLQKKKK
--
>C_i00001_3
QINLFTYMTKINKNLISQNEQPGEDEDIDTFSLFTSTIVCSFVVQSSETNMQHSFISIETRDIVQKPLLFKEIKQCLTYRPYRVSTYVWFEYKIDIYTSKRGEIITSAHHLHGLEAFLNSAQDHSFSPSPRHFSGVHLCMPRLEEKVFLVLTPLQLQSLCAQVAGSQIVLPLLVRLHSNHQTLLQQVLSALVLLQLGQPFVPSSQSHKDSACFQKFLPLQGLLHPYHYCPSFLRIDHKHKETYGFANFDIQTNKSAYQLHLQLNLHYTVTLKWSHFLSTQLVGLIGESSKPFSEFGLRREREREREREREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDT
--
>C_i00001_6
SIFLIIIYTKRLQFFFKKKPKKSNKFTHVITNPQPKPSLLDLIEKQKHDEHTLFNHHKRKPNNSFSNQIHPNPMKTKTASSNPIHNINRQYKTYPRIFQIFAGTNSHHRLPRSALTLSTSSSCLRSPITLILSGFAKWIATTGGYFWRRRFDRRSYLACSVASSSVGSASINGIDWFKGFEWVNVDRLITSCSSVSFFLGLFSFLFFFLTQLMLSQEPIVPYHTRALSSKRCYRSLTSHGLSLSLSLSLSLSSQSELKWLRRFPYQTHQSLSGQEMAPFQCHCVMIQLQMLVSRFVGLDIEVLSKTVGLFVFMVYSEERRTIMIRVKKPLQRQEFLETCRIFMRLRRHKWLSKLKQYECRKNLLQQSLMVTVKANRHYLGTSNLCAQRSLKLQWSDQKNFLFSPRHTKMHTRKVSWLRREVILRRVECLKASMMMRSNYFPSFTCLILNVNFVLKPDISANSVRAISALFNLLFEQERFLNLYISGLNTNKMLHICFTTLNHLERTHYSGCEQGKCVNVFILSWLFILTDIFIYFCHLISESIYLV
--
>C_i00002_4
VSSFIQKDYSFFLKKNLRRVTNLHMSHETPSPNHLYILRNKNMMNIEHYSTTIKENLIIVFLINNKFINLILRRRLHHQIRSIIKLIANNTKHTQEYSKSSPVQTPTTGYRDLRLLLVLVVVVFVLRLLLFLVDLPNGLHKPQADTFGEGGSIAGATLPVRPRPPLDRLLSTVLIGLRDLSGSTSTVSPPAPPVFSWACFLSCSSFLSCSSPKNLSSHTIPDEPCLLNDVTDLCKPAMASLSLSLSLSLSLSLSLLVHRQNYIYDKPQLYQFSFYSYRARQNMNNNYHFVLCTHVLSIPIPPSPTLLLQFNKRKALHIFSVLIMIIVSFIILVLPLDLLVPGLLIHSLVWILLIFIKIRDIEVRRNLLHLYTRSMLNVTKILQHLHFDCTKIRFRVCIIYYMPMWNLQIFWPKIFNVIVVGDLVWKLCVKNSSFNSPTPSNILFCVSATSSNQGQVEFLHKLNTLSMTINGKIEAAAISSICSTLEDYDTWS
--
>C_i00007_4
LSLSLSLSLFSIRTLKMASTIPLSNPPVTEWRRNGSISMSLCNVNSTANVVGKQISWSGYRSFKQNRRSLCVYGLFGGKKDNNDKGEEAPAKAGIFGNMQNLYETVKKAQMVVQVEAVRVQKELAAAEFDGYCEGELIKVTLSGNQQPVRTEITEAAMELGPEKLSLLVTEAYKDAHQKSVLAMKERMSDLAQSLGMPQGLNVNDALKLFPLFYLSNIKCQFCTQTRHKCLCKGDKLSIVFTLTGEVFELIYLWSQYMNVAYLFHYFEPLRKNTLWMTGKMCQCLHPLLVVHFDLDFYLFLSFNKIIDLFG
--
>C_i00008_3
EREAMAGLQRSVTSFRRQGSSGMVWDDRFLGELHQLSQKEEQERKQAQEKTHGGAGGDQTVDVDPLKSLKPINTVDRSRSNGGRGYRTGKVAPAIEPPSPKVSACGLCNPFGKSTKNKSNRRTKTTTTSTKSKRRSRPVVGVCTGEDLEYSWVCFVLLAINLIMDRICSLRLHRIRLMNLLLIRKTIIRFSFMVVECSMFIMFLFLNIRWFGLGVSCDYMCKFVTLLRFFFKKKLSFCINYYEDTLERERERERERERDFFQILAMEGFDGYKPAMAMVGLQCIYTGLALFTRAA
--
>C_i00009_4
VRTYVRTYVRTYVRTTTTTTTTTTLSLSLSLSLSLSLSLSLSLYLSLSLSHNFLVTLLSVLLLTTTSSSDLAKLYILIMKQVVLKLDFHDDRTKKKIMKTVSGHSGIDSISMDSKDMKLTVTGDIDPVSLVSKLRKLCNAEILSVGPPKAPEKKKEEAKKEEPKKQEPKKDELTELQKIWIAHQNAQMVSRPQPQYFVRSVEEDPNACVICAFIDCCDLPSRDVNFFNVGLGELMEGRLICFILFYFINSFELIIIVCLIFIYNSLFP
--
>C_i00011_2
IHSNKNYHDVRTYFVDLNNLHLNLYRLSNVKFIEYFTKNKRKRIEKIQPISTNPILKQITYFYKNQNPKKRKFKDLNSGFFGRFWFGCRFLRFSRLSLLRQNWVNVGKNTTAGDCNTVKQFPQFLIVPHSQLNVSRVDSSLLVVPGSISGQFQNFSGEVFKNGSVDGSTGTSTLGVSSLLEESSDTTHGKLKSSLDGLSDRLLPVSAFPSSGSLGSSLGFCSFHCNEIWKLFSETIRFEFWREQKFVEFVDLV
--
>C_i00012_5
SQISSNSKLNKLLFPPKLESNRFREKLPNFVSAMETTKSTKGGAKGAGGRKGGDRKKSVTKSVKAGLQFPVGRIARFLKKGRYAQRTGTGAPVYLAAVLEYLAAEVLELAGNAARDNKKTRINPRHVQLAVRNDEELGKLLHGVTIASGGVLPNINPVLLPKKTKSAESEKPATKSPKSPKKAVVFKFPFFWVLVLVEICNLFKNGICTNRLDLFNPFSFVLGKIFNEFYLILSLPVQSIMQIILQIYKICSHIMIIFVGMN
--
>C_i00013_6
NQQRITCFPSISNFFKNSNLFTENLLRSFSNGNYKSNQGRSQGSRRKERRRQEEVGDVRQGWTSVPRGSYRSIPQEGKIRSTYWYRCSRLPCCCSIPRRRGFGVGRKCCSQQEDNQPTRSIGCEERGIREVASRCYNRQRWCSSQHPSFATKEDQVCIETCNQITQISQKSLSLGLISFFLGFGSCRNMFVEWDLYVGSFQSFFFCSWNIQILPNFIMIITCTVNLDANYFTDLQNMFSHHDNFCWNE

Can anybody tells me what I am doing wrong with the syntax /input file/id file?

Thanks

fasta grep • 3.0k views
ADD COMMENT
2
Entering edit mode
9.3 years ago

It's because you used -A 1 see http://stackoverflow.com/questions/2168065

ADD COMMENT
0
Entering edit mode

Yes, you can use --no-group-separator to remove these extra lines. Contrary to what is mention in the thread http://stackoverflow.com/questions/2168065/how-do-i-get-rid-of-line-separator-when-using-grep-with-context-lines, this is not an undocumented option. It is not described in man grep, but it is described in info grep.

ADD REPLY
0
Entering edit mode

Thanks it works

grep -v "^--" result.fa > finalresult.fa
ADD REPLY
0
Entering edit mode

I have a space character at the end of IDs. I want to get rid of these spaces. I used this sed function:

sed -i 's/ *$//' id.txt

but its not working. Any suggestion?

ADD REPLY
0
Entering edit mode

"It's not working" is not that helpful. Perhaps give a description of what behavior you expect and what your command is actually doing. From what I can guess, your issue is that "*$" removes every line in your input file and leaves you with a blank output. You're also using the -i flag which overwrites your existing file. It also appears you have a backtick (`) instead of a single quote starting your expression. Try matching a space before the end of the line explicitly:

sed 's/ $//' < id.txt > id_nospace.txt
ADD REPLY
0
Entering edit mode

As you said issue was *$. I tried the code without backtick (`) and now it is giving me the desired output.

sed s/" "// < id.txt > idfinal.txt

As I was expecting the code remove all the blanck spaces from the id.txt file and stores in a new file idfinal.txt

Thanks for the help.

ADD REPLY
1
Entering edit mode
9.3 years ago

You can use a tool like samtools faidx or pyfaidx to do this:

$ (sudo) pip install pyfaidx
$ xargs faidx input.fasta < ids.txt > output.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6