Find a Term in a Abstract of Pubmed Using Python
3
0
Entering edit mode
9.5 years ago
priyag8179 • 0

I have multiple abstracts in text file format. And I want to fetch disease terms lets say "lung cancer". I need to check whether that text file has this term or not. And I want to use the same code for other terms as well. Please help.

python genome • 4.3k views
ADD COMMENT
0
Entering edit mode

You may check this python notebook, illustrating how to use Biopython for accessing entrez: https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Accessing_Databases.ipynb

ADD REPLY
1
Entering edit mode
9.5 years ago
nterhoeven ▴ 120

I am not familiar with python, but if you are on a unix computer, the easiest way would be this:

grep 'lung cancer' filename

This will give you all lines of the file containing "lung cancer". You can check

grep --help

for more info.

Maybe the --file=FILE is helpful for you. This way you can give a file with multiple terms (one per line) and search for all at once.

Edit:

Use the -i option to search case insensitive

ADD COMMENT
1
Entering edit mode
9.5 years ago
13en ▴ 90

Is each abstract a separate text file? If so, you can easily find which files contain your keyword using grep, no Python needed.

Somthing like:

grep -l "lung cancer" *.txt

should give you a list of all the filenames containing the phrase. Provided the abstracts are all in separate text files, and those are the only text files in the folder.

ADD COMMENT
0
Entering edit mode

I have a text file which has this abstract, if that term exist I will need to write them in output file if that term does not exist display some message. So I want to make my code more automated. Like user just need to type keywords, it will search the text and based upon that output will be displayed

ADD REPLY
0
Entering edit mode

Should probably clear up some details:

Is there one file per abstract, or multiple abstracts in a single file?

If there are more than one abstracts in a file, how are they separated?

Are you using linux/unix or Windows?

(assuming unix, since you didn't just come back with "what's grep?") If you need the grep output in a file, you can just redirect the output like this:

grep -l "lung cancer" *.txt > files_with_term.txt

If you want it to be reusable, I would suggest you take a bit of time to look into shell scripting. It should be a pretty straightforward task that you can pick up with a little googling.

ADD REPLY
0
Entering edit mode

There are multiple abstract in single text file.

e.g., inputfile=

1. Asian Pac J Cancer Prev. 2015;16(9):4095-101.

Mutation Analysis of IDH1/2 Genes in Unselected De novo Acute Myeloid Leukaemia
Patients in India - Identification of A Novel IDH2 Mutation.

Raveendran S(1), Sarojam S, Vijay S, Geetha AC, Sreedharan J, Narayanan G,
Sreedharan H.

Author information:
(1)Division of Cancer Research, Regional Cancer Centre, Thiruvananthapuram, India
E-mail : drshariharan@gmail.com.

IDH1/2 mutations which result in alternation in DNA methylation pattern are one
of the most common methylation associated mutations in Acute myeloid leukaemia.
IDH1/2 mutations frequently associated with higher platelet level, normal
cytogentics and NPM1 mutations. Here we analyzed IDH1/2 mutations in 200 newly
diagnosed unselected Indian adult AML patients and investigated their correlation
with clinical, cytogenetic parameters along with cooperating NPM1 mutation. We
detected 5.5% and 4% mutations in IDH1/2 genes, respectively. Except IDH2
c.515_516GG>AA mutation, all the other identified mutations were reported
mutations. Similar to reported c.515G>A mutation, the novel c.515_516GG>AA
mutation replaces 172nd arginine to lysine in the active site of the enzyme. Even
though there was a preponderance of IDH1/2 mutations in NK-AML, cytogenetically
abnormal patients also harboured IDH1/2 mutations. IDH1 mutations showed
significant higher platelet count and NPM1 mutations. IDH2 mutated patients
displayed infrequent NPM1 mutations and lower WBC count. All the NPM1 mutations
in the IDH1/2 mutated cases showed type A mutation. The present data suggest that
IDH1/2 mutations are associated with normal cytogenetics and type A NPM1
mutations in adult Indian AML patients.

PMID: 25987093  [PubMed - in process]
ADD REPLY
0
Entering edit mode
9.5 years ago
13en ▴ 90

Well that makes things a bit more complex. I guess you'd want something along the lines of this (note: this is assuming that, as above, the PMID ends each abstract):

filename = "abstract_filename_goes_here" # look at argparse to pass the filename from the commandline
outputfile = "output_filename.txt"
search_term = "lung cancer"     # again, look into argparse
term_found = False
results = []

f = open(filename, "r")
for line in f:
    if search_term in line:
        term_found = True 
    if line.startswith("PMID"):
        if term_found == True:
            results.append(line)    # record the PMID line for abstracts containing the search term
            term_found = False      # reset for the next abstract
out = open(outputfile, "w")
for pmid in results:
    out.write(pmid)

Please bear in mind that I haven't tested this, it's not been run at all and I'm far from an expert. You should expect that it will probably not work exactly as is, and you may need to change a few things!

edit: I'd meant to add this as a comment, but clearly clicked the wrong button! Since it kind of is a new answer I guess it can stay here, and I've reached the post limit so I can't move it myself. If any mods feel I've made a mistake though, feel free to sort it out however you see fit.

ADD COMMENT

Login before adding your answer.

Traffic: 1541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6