How to grep every row for the every value of a column?
1
0
Entering edit mode
3.7 years ago
markgodek ▴ 50

I'm trying to find matching Ensembl IDs between my germline and somatic datasets. I would intersect on chromosome and position, but I want to know if there are somatic AND germline variants in the same gene for each patient.

Germline data is like:

       CHROM      POS REF ALT    AF GNOMAD_AF    GENE          FEATURE  DP AC AN     sample genotype_final                                                    GeneID patient
1:     1   860778   A   G 0.900     0.816  SAMD11 FIVE_PRIME_FLANK  76  2  2 Sample1            1/1 NA, NA, ENSG00000223764, ENSG00000187634, ENSG00000268179  Patient14
2:     1   874327   G   C 0.031 2.462e-05  SAMD11           INTRON 606  1 32 Sample7            0/1                                       NA, ENSG00000187634  Patient18
3:     1   877831   T   C 1.000         1  SAMD11         MISSENSE 336 14 14 Sample3            1/1                  NA, NA, ENSG00000188976, ENSG00000187634  Patient9

Somatic data is like:

CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample1 Sample2 GeneID
3   52440897    .   TC  AA  .   PASS    LongAnnotation  GT:AD:AF:DP:F1R2:F2R1:SB    0/1:74,4:0.068:78:23,2:50,2:48,26,3,1   0/0:94,0:0.012:94:41,0:52,0:52,42,0,0   ENSG00000187634

For every row in the somatic data, I'd like to take the final column (GeneID) and use it to grep the GeneID in the germline data. This solution is working for smaller somatic variant sets, but breaks on variants where there was no germline sample for filtering the somatic variants. The string construction is too large and yields the invalid regular expression error further below.

patient.matcher <- function(germline.dt,somatic.dt, patientID){return(germline.dt[grepl(paste(somatic.dt$GeneID, collapse = '|'), GeneID) & patient==patientID])}

patient.matcher(patients.w.geneIDs.germline, Patient9.somatic.dt, 'Patient9')

Error: invalid regular expression 'ENSG00000188976|ENSG00000187583|ENSG00000188290|E.............

Any help is appreciated, I can't figure out which permutation of apply() and grep() to use to make this work on an arbitrarily large dataset.

RGeneID • 1.2k views
ADD COMMENT
1
Entering edit mode

Is the gene_id always a single value in somatic data?

ADD REPLY
0
Entering edit mode

Adding on this, can you please provide example data and an expected output, the former via dput()?

ADD REPLY
0
Entering edit mode

Thank you for asking this. It got me to think WHY the somatic have a single GeneID when the germline have several - when the GeneID should come from the Combined Annotation Dependent Depletion (CADD) annotation file.

For the somatic variants, I was grepping for a single string that began with "ENSG", when I should be using the same CADD annotations for the germline variants based on CHROM, POS, REF, ALT.

i.e. if I properly annotated them, the GeneID columns between somatic and germline will be identical and I can just intersect on CHROM, POS, REF, ALT, and GeneID.

Thanks again. Just took a single question to get me to walk through it stepwise.

Edit: Actually, thinking about it again - we didn't want to match them on CHROM and POS exactly. We wanted to find the genes which were mutated in somatic AND germline so I might have to think more on this.

ADD REPLY
1
Entering edit mode
3.7 years ago
Shred ★ 1.5k

I would do it in Python3. Build a dictionary indexed by gene_id value (do it only if keys are unique!!) where value is the whole line of the somatic file, then parse the germline file to retrieve output. Maybe some tricking would be required to delete extraspace or comma. Take this mainly as an approach and not like a production-ready solution.

gene_id_somatic_list={}
with open('somatic_data', 'r') as nk:
    for line in nk:
        gene_id=line.split('\t')[11]
        gene_id_somatic_list[gene_id]=line
with open('germline_data', 'r') as lk:
    for line in lk:
        gene_list=[]
        gene_list.append(line.split('\t')[11])
        for k in gene_list:
            try:
                print(gene_id_somatic_list[k])
            except KeyError:
                continue
ADD COMMENT

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6