Hello Experts,
I have multi fasta file with a large headers. i need to extract certain sequences from the file.
The follwoing are the 1st two sequences as an example:
>nxp:NX_A0A024RBG1-1 \PName=Diphosphoinositol polyphosphate phosphohydrolase NUDT4B isoform Iso 1 \GName=NUDT4B \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=181 \SV=1 \EV=58 \PE=1 \Processed=(1|181|PEFF:0001020|mature protein)
MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQWIVPGGGMEPEEEPG
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
FKVEDAIKVLQCHKPVHAEYLEKLKLGCSPANGNSTVPSLPDNNALFVTAAQTSGLPSSV
R
>nxp:NX_A0A075B6H7-1 \PName=Probable non-functional immunoglobulin kappa variable 3-7 isoform Iso 1 \GName=IGKV3-7 \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=116 \SV=1 \EV=45 \PE=1 \ModResPsi=(43|MOD:00798|half cystine)(109|MOD:00798|half cystine) \VariantSimple=(2|Q)(4|L)(5|T)(5|V)(6|H)(6|P)(7|F)(11|M)(11|P)(13|I)(14|*)(15|F)(16|R)(17|N)(18|A)(18|I)(18|N)(18|P)(19|A)(19|I)(19|S)(20|G)(20|K)(21|Q)(22|M)(22|T)(23|A)(23|I)(24|I)(24|L)(25|A)(26|E)(26|K)(26|L)(26|P)(27|F)(29|A)(29|L)(29|S)(30|N)(30|P)(31|Q)(31|V)(32|F)(33|W)(34|F)(34|P)(34|Y)(36|E)(37|*)(37|A)(37|K)(38|S)(39|A)(40|A)(40|N)(41|F)(42|F)(43|*)(43|F)(43|R)(43|W)(43|Y)(44|G)(44|K)(44|M)(45|G)(45|V)(46|C)(46|R)(46|T)(47|H)(48|N)(48|R)(49|I)(49|L)(50|G)(50|N)(51|N)(51|R)(52|G)(52|I)(52|N)(52|R)(52|T)(53|C)(53|S)(54|*)(54|S)(55|I)(55|S)(56|*)(56|C)(56|L)(56|R)(57|C)(57|F)(57|S)(59|H)(59|R)(60|Q)(60|T)(61|L)(61|R)(61|S)(62|D)(62|S)(62|V)(63|P)(64|T)(64|V)(65|T)(66|W)(67|F)(67|I)(67|P)(69|M)(69|T)(70|C)(71|D)(71|V)(72|G)(72|V)(73|F)(73|T)(74|N)(74|S)(75|S)(76|D)(76|T)(77|N)(77|S)(78|C)(78|G)(78|I)(78|N)(79|N)(80|L)(80|Q)(80|S)(81|D)(83|I)(83|S)(84|G)(84|N)(85|A)(85|D)(85|R)(85|V)(86|C)(86|R)(87|A)(87|R)(88|A)(89|E)(89|R)(90|A)(90|I)(90|K)(90|R)(91|E)(91|G)(92|I)(92|L)(93|P)(96|F)(96|S)(99|Q)(100|E)(101|S)(102|*)(102|Q)(103|Y)(104|V)(105|V)(106|L)(108|N)(109|F)(110|*)(110|K)(110|R)(111|R)(113|*)(113|C)(114|I)(116|A)(116|H) \Processed=(1|21|PEFF:0001021|signal peptide)(22|116|PEFF:0001020|mature protein)
MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRASQSVSSSYLTWYQQK
PGQAPRLLIYGASTRATSIPARFSGSGSGTDFTLTISSLQPEDFAVYYCQQDYNLP
I want to extract only those sequences which SV=2
in the header
Could anyone help me extract the data?
Any help would be much appricated
Thank you for your time
First make a list of sequence headers for those sequences you want to extract, e.g. using grep, and then use Extract fasta sequences from a file using a list in another file.
Please try at least something and then provide code and errors/shortcomings if you get stuck. Biostars is generally not a code-writing service.
I have a Python script,
get_seq_from_multiFASTA_with_match_in_description.py
that gets the first the sequence with the first match in the description line. The page here describes that script and points to a Jupyter notebook containing a demo of it. You can run the demo by going there and pressing thelaunch binder
button and then selecting 'Demo of script to get sequence from multiFASTA file when description contains matching text
' from the list of available notebooks. The particular notebook can be viewed statically, nicely displayed here.That script could probably be altered to continue on and collect all the instances with a match in the description. You can probably use as a basis for extending the script another of my scripts,
remove_seq_from_multiFASTA_with_match_in_description.py
described on the page here and available here.Also of note, is that because your current pattern contains an equal sign, I think it won't work on the command line and you'll want to use the method outlined under 'Use script in a Jupyter notebook to collect sequences from a series of PacBio-sequenced genomes' in the demo page. That way it will easily take the string
SV=2
for the search. I didn't build in allowing regular expressions yet; however, I left guidance, meant for myself, on how that extension could probably be implemented.You may also have a look at SEDA (https://www.sing-group.org/seda/), which includes a "Pattern filtering" operation to filter sequences based on regular expression patterns (https://www.sing-group.org/seda/manual/operations.html#pattern-filtering).