Creating Gene ID and Gene name colum from FASTA IDs
0
0
Entering edit mode
22 months ago
WUSCHEL ▴ 810

I have and proteomics data output file with a Fasta headers column. Each cell has multiple FASTA IDs.

How can I extract information and make a new column for Gene IDs and Gene names extracting information from Fasta headers column.

Here is an example of raws

dput(FASTA)

structure(list(`Fasta headers` = c("tr|A0A068LJH6|A0A068LJH6_PEA 5-methyltetrahydropteroyltriglutamate--homocysteine S-methyltransferase OS=Pisum sativum OX=3888 PE=2 SV=1", 
"tr|P94096|P94096_PEA Actin OS=Pisum sativum OX=3888 GN=PEAc9 PE=2 SV=1;sp|P46258|ACT3_PEA Actin-3 OS=Pisum sativum OX=3888 PE=2 SV=1;tr|A0A0A1E7Y8|A0A0A1E7Y8_PEA Actin isoform 3-1 OS=Pisum sativum OX=3888 PE=2 SV=1;tr|Q7DMB2|Q7DMB2_PEA Actin (Fragment) OS=", 
"sp|P05310|PSAA_PEA Photosystem I P700 chlorophyll a apoprotein A1 OS=Pisum sativum OX=3888 GN=psaA PE=1 SV=2;tr|A0A385JEC4|A0A385JEC4_PEA Photosystem I P700 chlorophyll a apoprotein A1 OS=Pisum sativum subsp. sativum OX=208194 GN=psaA PE=3 SV=1;tr|A0A2S1CE", 
"tr|D5MAI6|D5MAI6_PEA NAD(P)H-quinone oxidoreductase subunit H, chloroplastic OS=Pisum sativum OX=3888 GN=ndhH PE=3 SV=1;tr|A0A385JEU5|A0A385JEU5_PEA NAD(P)H-quinone oxidoreductase subunit H, chloroplastic OS=Pisum sativum subsp. sativum OX=208194 GN=ndhH P", 
"sp|P08214|ATPF_PEA ATP synthase subunit b, chloroplastic OS=Pisum sativum OX=3888 GN=atpF PE=3 SV=1;tr|D5MAK1|D5MAK1_PEA ATP synthase subunit b, chloroplastic OS=Pisum sativum OX=3888 GN=atpF PE=3 SV=1;tr|A0A8A4JI82|A0A8A4JI82_PEA ATP synthase subunit b, c", 
"tr|A0A8A4JH06|A0A8A4JH06_PEA 30S ribosomal protein S18, chloroplastic OS=Pisum sativum subsp. elatius OX=47742 GN=rps18 PE=3 SV=1;sp|P49169|RR18_PEA 30S ribosomal protein S18, chloroplastic OS=Pisum sativum OX=3888 GN=rps18 PE=3 SV=1;tr|A0A385JEK5|A0A385JE"
)), row.names = c(NA, -6L), spec = structure(list(cols = list(
    `Fasta headers` = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001ec36e3bca0>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

For example for the second raw data; in Gene ID comun I should have P94096; P46258; A0A0A1E7Y8 ; Q7DMB2

Gene names column I should have P94096_PEA Actin OS=Pisum sativum; ACT3_PEA Actin-3; A0A0A1E7Y8_PEA Actin isoform 3-1;Q7DMB2_PEA Actin

Data in each raw varies. Each unique gens information is separated by a semicolon. And Gene ID is within | separator.

any idea, how can I make anew column for Gene names column (separated by semi colons)?

programming r data FASTA • 1.1k views
ADD COMMENT
1
Entering edit mode

What was your approach? Have you looked into regex?

ADD REPLY
0
Entering edit mode

I am trying however my script is not giving what I'm expecting.

 fasta  <- read_csv("Data/FASTA headers.csv", col_names = c("X1"))[-1,]
fasta <- fasta %>% separate_rows(X1, sep = ";") %>%
    # Separate by |
      separate(col = X1, into = c("X2", "X3", "X4"), sep = "\\|")

For an example for raw 2 I nedd something like this enter image description here

ADD REPLY
2
Entering edit mode

You can try extracting with regex e.g. str_extract_all(headers, "(?<=[tr|sp]\\|).+?(?=\\|)")

explanation of the regex: https://regexr.com/76i24

ADD REPLY

Login before adding your answer.

Traffic: 2272 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6