I have and proteomics data output file with a Fasta headers column. Each cell has multiple FASTA IDs.
How can I extract information and make a new column for Gene IDs
and Gene names
extracting information from Fasta headers
column.
Here is an example of raws
dput(FASTA)
structure(list(`Fasta headers` = c("tr|A0A068LJH6|A0A068LJH6_PEA 5-methyltetrahydropteroyltriglutamate--homocysteine S-methyltransferase OS=Pisum sativum OX=3888 PE=2 SV=1",
"tr|P94096|P94096_PEA Actin OS=Pisum sativum OX=3888 GN=PEAc9 PE=2 SV=1;sp|P46258|ACT3_PEA Actin-3 OS=Pisum sativum OX=3888 PE=2 SV=1;tr|A0A0A1E7Y8|A0A0A1E7Y8_PEA Actin isoform 3-1 OS=Pisum sativum OX=3888 PE=2 SV=1;tr|Q7DMB2|Q7DMB2_PEA Actin (Fragment) OS=",
"sp|P05310|PSAA_PEA Photosystem I P700 chlorophyll a apoprotein A1 OS=Pisum sativum OX=3888 GN=psaA PE=1 SV=2;tr|A0A385JEC4|A0A385JEC4_PEA Photosystem I P700 chlorophyll a apoprotein A1 OS=Pisum sativum subsp. sativum OX=208194 GN=psaA PE=3 SV=1;tr|A0A2S1CE",
"tr|D5MAI6|D5MAI6_PEA NAD(P)H-quinone oxidoreductase subunit H, chloroplastic OS=Pisum sativum OX=3888 GN=ndhH PE=3 SV=1;tr|A0A385JEU5|A0A385JEU5_PEA NAD(P)H-quinone oxidoreductase subunit H, chloroplastic OS=Pisum sativum subsp. sativum OX=208194 GN=ndhH P",
"sp|P08214|ATPF_PEA ATP synthase subunit b, chloroplastic OS=Pisum sativum OX=3888 GN=atpF PE=3 SV=1;tr|D5MAK1|D5MAK1_PEA ATP synthase subunit b, chloroplastic OS=Pisum sativum OX=3888 GN=atpF PE=3 SV=1;tr|A0A8A4JI82|A0A8A4JI82_PEA ATP synthase subunit b, c",
"tr|A0A8A4JH06|A0A8A4JH06_PEA 30S ribosomal protein S18, chloroplastic OS=Pisum sativum subsp. elatius OX=47742 GN=rps18 PE=3 SV=1;sp|P49169|RR18_PEA 30S ribosomal protein S18, chloroplastic OS=Pisum sativum OX=3888 GN=rps18 PE=3 SV=1;tr|A0A385JEK5|A0A385JE"
)), row.names = c(NA, -6L), spec = structure(list(cols = list(
`Fasta headers` = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001ec36e3bca0>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
For example for the second raw data; in Gene ID comun I should have P94096; P46258; A0A0A1E7Y8 ; Q7DMB2
Gene names column I should have P94096_PEA Actin OS=Pisum sativum; ACT3_PEA Actin-3; A0A0A1E7Y8_PEA Actin isoform 3-1;Q7DMB2_PEA Actin
Data in each raw varies. Each unique gens information is separated by a semicolon. And Gene ID is within | separator.
any idea, how can I make anew column for Gene names column (separated by semi colons)?
What was your approach? Have you looked into regex?
I am trying however my script is not giving what I'm expecting.
For an example for raw 2 I nedd something like this
You can try extracting with regex e.g.
str_extract_all(headers, "(?<=[tr|sp]\\|).+?(?=\\|)")
explanation of the regex: https://regexr.com/76i24