Question

Creating Gene ID and Gene name colum from FASTA IDs

0

Entering edit mode

2.1 years ago

WUSCHEL ▴ 850

I have and proteomics data output file with a Fasta headers column. Each cell has multiple FASTA IDs.

How can I extract information and make a new column for Gene IDs and Gene names extracting information from Fasta headers column.

Here is an example of raws

dput(FASTA)

structure(list(`Fasta headers` = c("tr|A0A068LJH6|A0A068LJH6_PEA 5-methyltetrahydropteroyltriglutamate--homocysteine S-methyltransferase OS=Pisum sativum OX=3888 PE=2 SV=1", 
"tr|P94096|P94096_PEA Actin OS=Pisum sativum OX=3888 GN=PEAc9 PE=2 SV=1;sp|P46258|ACT3_PEA Actin-3 OS=Pisum sativum OX=3888 PE=2 SV=1;tr|A0A0A1E7Y8|A0A0A1E7Y8_PEA Actin isoform 3-1 OS=Pisum sativum OX=3888 PE=2 SV=1;tr|Q7DMB2|Q7DMB2_PEA Actin (Fragment) OS=", 
"sp|P05310|PSAA_PEA Photosystem I P700 chlorophyll a apoprotein A1 OS=Pisum sativum OX=3888 GN=psaA PE=1 SV=2;tr|A0A385JEC4|A0A385JEC4_PEA Photosystem I P700 chlorophyll a apoprotein A1 OS=Pisum sativum subsp. sativum OX=208194 GN=psaA PE=3 SV=1;tr|A0A2S1CE", 
"tr|D5MAI6|D5MAI6_PEA NAD(P)H-quinone oxidoreductase subunit H, chloroplastic OS=Pisum sativum OX=3888 GN=ndhH PE=3 SV=1;tr|A0A385JEU5|A0A385JEU5_PEA NAD(P)H-quinone oxidoreductase subunit H, chloroplastic OS=Pisum sativum subsp. sativum OX=208194 GN=ndhH P", 
"sp|P08214|ATPF_PEA ATP synthase subunit b, chloroplastic OS=Pisum sativum OX=3888 GN=atpF PE=3 SV=1;tr|D5MAK1|D5MAK1_PEA ATP synthase subunit b, chloroplastic OS=Pisum sativum OX=3888 GN=atpF PE=3 SV=1;tr|A0A8A4JI82|A0A8A4JI82_PEA ATP synthase subunit b, c", 
"tr|A0A8A4JH06|A0A8A4JH06_PEA 30S ribosomal protein S18, chloroplastic OS=Pisum sativum subsp. elatius OX=47742 GN=rps18 PE=3 SV=1;sp|P49169|RR18_PEA 30S ribosomal protein S18, chloroplastic OS=Pisum sativum OX=3888 GN=rps18 PE=3 SV=1;tr|A0A385JEK5|A0A385JE"
)), row.names = c(NA, -6L), spec = structure(list(cols = list(
    `Fasta headers` = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001ec36e3bca0>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

For example for the second raw data; in Gene ID comun I should have P94096; P46258; A0A0A1E7Y8 ; Q7DMB2

Gene names column I should have P94096_PEA Actin OS=Pisum sativum; ACT3_PEA Actin-3; A0A0A1E7Y8_PEA Actin isoform 3-1;Q7DMB2_PEA Actin

Data in each raw varies. Each unique gens information is separated by a semicolon. And Gene ID is within | separator.

any idea, how can I make anew column for Gene names column (separated by semi colons)?

programming r data FASTA • 1.2k views

ADD COMMENT • link updated 2.1 years ago by barslmn ★ 2.3k • written 2.1 years ago by WUSCHEL ▴ 850

1

Entering edit mode

What was your approach? Have you looked into regex?

ADD REPLY • link 2.1 years ago by barslmn ★ 2.3k

0

Entering edit mode

I am trying however my script is not giving what I'm expecting.

 fasta  <- read_csv("Data/FASTA headers.csv", col_names = c("X1"))[-1,]
fasta <- fasta %>% separate_rows(X1, sep = ";") %>%
    # Separate by |
      separate(col = X1, into = c("X2", "X3", "X4"), sep = "\\|")

For an example for raw 2 I nedd something like this enter image description here

ADD REPLY • link 2.1 years ago by WUSCHEL ▴ 850

2

Entering edit mode

You can try extracting with regex e.g. str_extract_all(headers, "(?<=[tr|sp]\\|).+?(?=\\|)")

explanation of the regex: https://regexr.com/76i24

ADD REPLY • link 2.1 years ago by barslmn ★ 2.3k