Question

Extract certain fasta sequnce from multi fasta file

0

Entering edit mode

2.3 years ago

luffy ▴ 130

Hello Experts,

I have multi fasta file with a large headers. i need to extract certain sequences from the file.

The follwoing are the 1st two sequences as an example:

>nxp:NX_A0A024RBG1-1 \PName=Diphosphoinositol polyphosphate phosphohydrolase NUDT4B isoform Iso 1 \GName=NUDT4B \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=181 \SV=1 \EV=58 \PE=1 \Processed=(1|181|PEFF:0001020|mature protein)
MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQWIVPGGGMEPEEEPG
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
FKVEDAIKVLQCHKPVHAEYLEKLKLGCSPANGNSTVPSLPDNNALFVTAAQTSGLPSSV
R

>nxp:NX_A0A075B6H7-1 \PName=Probable non-functional immunoglobulin kappa variable 3-7 isoform Iso 1 \GName=IGKV3-7 \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=116 \SV=1 \EV=45 \PE=1 \ModResPsi=(43|MOD:00798|half cystine)(109|MOD:00798|half cystine) \VariantSimple=(2|Q)(4|L)(5|T)(5|V)(6|H)(6|P)(7|F)(11|M)(11|P)(13|I)(14|*)(15|F)(16|R)(17|N)(18|A)(18|I)(18|N)(18|P)(19|A)(19|I)(19|S)(20|G)(20|K)(21|Q)(22|M)(22|T)(23|A)(23|I)(24|I)(24|L)(25|A)(26|E)(26|K)(26|L)(26|P)(27|F)(29|A)(29|L)(29|S)(30|N)(30|P)(31|Q)(31|V)(32|F)(33|W)(34|F)(34|P)(34|Y)(36|E)(37|*)(37|A)(37|K)(38|S)(39|A)(40|A)(40|N)(41|F)(42|F)(43|*)(43|F)(43|R)(43|W)(43|Y)(44|G)(44|K)(44|M)(45|G)(45|V)(46|C)(46|R)(46|T)(47|H)(48|N)(48|R)(49|I)(49|L)(50|G)(50|N)(51|N)(51|R)(52|G)(52|I)(52|N)(52|R)(52|T)(53|C)(53|S)(54|*)(54|S)(55|I)(55|S)(56|*)(56|C)(56|L)(56|R)(57|C)(57|F)(57|S)(59|H)(59|R)(60|Q)(60|T)(61|L)(61|R)(61|S)(62|D)(62|S)(62|V)(63|P)(64|T)(64|V)(65|T)(66|W)(67|F)(67|I)(67|P)(69|M)(69|T)(70|C)(71|D)(71|V)(72|G)(72|V)(73|F)(73|T)(74|N)(74|S)(75|S)(76|D)(76|T)(77|N)(77|S)(78|C)(78|G)(78|I)(78|N)(79|N)(80|L)(80|Q)(80|S)(81|D)(83|I)(83|S)(84|G)(84|N)(85|A)(85|D)(85|R)(85|V)(86|C)(86|R)(87|A)(87|R)(88|A)(89|E)(89|R)(90|A)(90|I)(90|K)(90|R)(91|E)(91|G)(92|I)(92|L)(93|P)(96|F)(96|S)(99|Q)(100|E)(101|S)(102|*)(102|Q)(103|Y)(104|V)(105|V)(106|L)(108|N)(109|F)(110|*)(110|K)(110|R)(111|R)(113|*)(113|C)(114|I)(116|A)(116|H) \Processed=(1|21|PEFF:0001021|signal peptide)(22|116|PEFF:0001020|mature protein)
MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRASQSVSSSYLTWYQQK
PGQAPRLLIYGASTRATSIPARFSGSGSGTDFTLTISSLQPEDFAVYYCQQDYNLP

I want to extract only those sequences which SV=2 in the header

Could anyone help me extract the data?

Any help would be much appricated

Thank you for your time

python regex bash fasta • 1.4k views

ADD COMMENT • link updated 2.3 years ago by Antonio R. Franco ★ 5.2k • written 2.3 years ago by luffy ▴ 130

0

Entering edit mode

First make a list of sequence headers for those sequences you want to extract, e.g. using grep, and then use Extract fasta sequences from a file using a list in another file.

Please try at least something and then provide code and errors/shortcomings if you get stuck. Biostars is generally not a code-writing service.

ADD REPLY • link 2.3 years ago by ATpoint 85k

0

Entering edit mode

I have a Python script, get_seq_from_multiFASTA_with_match_in_description.py that gets the first the sequence with the first match in the description line. The page here describes that script and points to a Jupyter notebook containing a demo of it. You can run the demo by going there and pressing the launch binder button and then selecting 'Demo of script to get sequence from multiFASTA file when description contains matching text' from the list of available notebooks. The particular notebook can be viewed statically, nicely displayed here.

That script could probably be altered to continue on and collect all the instances with a match in the description. You can probably use as a basis for extending the script another of my scripts, remove_seq_from_multiFASTA_with_match_in_description.py described on the page here and available here.

Also of note, is that because your current pattern contains an equal sign, I think it won't work on the command line and you'll want to use the method outlined under 'Use script in a Jupyter notebook to collect sequences from a series of PacBio-sequenced genomes' in the demo page. That way it will easily take the string SV=2 for the search. I didn't build in allowing regular expressions yet; however, I left guidance, meant for myself, on how that extension could probably be implemented.

ADD REPLY • link 2.3 years ago by Wayne ★ 2.1k

0

Entering edit mode

You may also have a look at SEDA (https://www.sing-group.org/seda/), which includes a "Pattern filtering" operation to filter sequences based on regular expression patterns (https://www.sing-group.org/seda/manual/operations.html#pattern-filtering).

ADD REPLY • link 2.3 years ago by Hugo ▴ 380

score 0 · Answer 1 · 2022-08-02

With cat name_fasta | grep "SV=2" | cut -f1 > names_of_files_having_SV=2.txt you get a file containing the names of the fasta files you want.

If the name is not extracted with the cut -f1 name, try to investigate the kind of delimitator the first lane of the fasta is using (i.e cut -d " " -f1 indicate that an empty space is delimiting the fields)

Then once obtained the file with the name of files, use faSomeRecord