compare two NCBI ftp tables
0
0
Entering edit mode
3.6 years ago
Debut ▴ 20

Hello everyone, I am a beginner in bioinformatics and I am a bit stuck. I downloaded the FTP summary GenBank table with the "wget" command a week ago. I've also downloaded the same table today, I'd like to compare the rows of the two tables to see if there is a new annotation. I don't want to compare the columns of the table but the rows. it's to see if there are rows that have been added. it's like an update system.

table NCBI • 2.3k views
ADD COMMENT
0
Entering edit mode

We had recommend genome_updater (LINK) tool to you in a previous thread. One benefit of using a tool like this is that it will keep track of changes. You could use

-k: dry-run - do not perform any download or update, but shows number of files to be downloaded or updated

to see the changes or use following options to see changes

-u: Added/Removed assembly accessions
-r: Added/Removed sequence accessions
ADD REPLY
0
Entering edit mode

thank you for your answer , I have to put this command line "-k: dry-run - do not perform any download or update, but shows number of files to be downloaded or updated"? but how will the system know that I'm talking about the two tables whose change I want to see? the name of the file is "assembly_summary_genbank.txt". it is to compare two tables downloaded from the same place but not at the same time. I would like to compare line by line if it is possible and see the lines that have been added.

ADD REPLY
0
Entering edit mode

This was not an answer for your specific question. It was a recommendation that if you want to do this regularly then you should use a tool meant for downloading genome data from NCBI. It would make your life simpler. I moved my answer to a comment.

You could try unix utility diff (google for how to use it) to compare the files. You may not be able to find a pre-written program but one can write something to this comparison using python or suitable language.

ADD REPLY
0
Entering edit mode

Thank you for your answer, I hope to find a solution to this problem. I will see with python.

ADD REPLY
0
Entering edit mode

See this thread: https://unix.stackexchange.com/questions/428419/how-to-write-the-difference-between-two-files-into-a-file

comm -1 -3 file1.txt file2.txt > new.txt

should give you new lines in file2.txt. This is assuming NCBI appends new lines at end of these summary files. If not the input files will need to be sorted.

comm -1 -3 <(sort file1.txt) <(sort file2.txt) > new.txt
ADD REPLY
0
Entering edit mode

Thank you very much for your answers but as I am not sure that NCBI adds the new lines at the end, I tried your second code but I have a file not with the new lines but with all the lines I think.

ADD REPLY
0
Entering edit mode

i have a file not with the new lines but with all the lines I think.

To convince you that should not be happening I tried a couple of things with some files I had sitting around that were downloaded at different times. While the lines are not precisely adding up I am not getting the same number of lines in my final files.

$ diff --suppress-common-lines <(sort assembly_summary_genbank.txt.2 | uniq) <(sort assembly_summary_genbank.txt.3 | uniq) > new.txt

$ comm -1 -3 <(sort assembly_summary_genbank.txt.2 | uniq) <(sort assembly_summary_genbank.txt.3 | uniq) > new1.txt

$ wc -l assembly_summary_genbank.txt* new*.txt 

    941245 assembly_summary_genbank.txt.2
   1017115 assembly_summary_genbank.txt.3
    114763 new1.txt
    167302 new.txt

As I said before writing a proper parsing program is the surefire way to make this work.

Disclaimer: If you are not downloading the report files directly from NCBI FTP site with wget/curl (i.e. if you are saving them from a web browser or something else) then above may or may not work.

ADD REPLY
0
Entering edit mode

I can't find a command that can compare each line of the second file with all lines of the first file. If there are lines that do not exist in the first file, put them in a file that can be called new.txt

ADD REPLY
0
Entering edit mode

Save the following code in a file (say ncbi.py).

import sys

data1 = []
data2 = []
check = {}

file1 = sys.argv[1]
datafile = sys.argv[2]

# create a dictionary for ID's from old file

for line in open(file1):
    if "#" in line:
        continue
    else:
        data1 = line.strip().split(".")
        check[data1[0]] = 1

# check for new id's in new file

i=0 
for line1 in open(datafile):
    if "#" in line1:
        continue
    else:
        data2 = line1.strip().split("\t")
        acc = data2[0].split(".")
        if acc[0] in check:
            #print(f"Already there")
            continue
        else:
            print(line1.strip())
            i += 1

print(f"Above {i} accessions are new")

Then run it as follows

$ python3 ncbi.py previous_assembly_file new_assembly_file

You should see output like this

GCA_000001215.4 PRJNA13812      SAMN02803731            reference genome        7227    7227    Drosophila melanogaster                 latest  Chromosome      Major   Full    2014/08/01      Release 6 plus ISO1 MT  The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics     GCF_000001215.4 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT
GCA_000001405.28        PRJNA31257                      reference genome        9606    9606    Homo sapiens                    latest  Chromosome      Patch   Full    2019/02/28      GRCh38.p13      Genome Reference Consortium   GCF_000001405.39        different       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13
GCA_000001545.3 PRJNA20869      SAMN02981238    ABGA00000000.1  na      9601    9601    Pongo abelii            ISIS 71 latest  Chromosome      Major   Full    2008/11/13      P_pygmaeus_2.0.2        Orangutan Genome Sequencing Consortium        na      na      ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/545/GCA_000001545.3_P_pygmaeus_2.0.2
GCA_000001635.9 PRJNA20689                      reference genome        10090   10090   Mus musculus                    latest  Chromosome      Major   Full    2020/06/24      GRCm39  Genome Reference Consortium     GCF_000001635.27      identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.9_GRCm39
GCA_907173305.1 PRJEB40874      SAMEA8703779    CAJSFZ000000000.1       na      35818   35818   Helicobacter pullorum           ERR1543774_bin.1_metaWRAP_v1.1_MAG      latest  Contig  Major   Full    2021/05/16      ERR1543774_bin.1_metaWRAP_v1.1_MAG    EMG     na      na      ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/907/173/305/GCA_907173305.1_ERR1543774_bin.1_metaWRAP_v1.1_MAG       derived from metagenome; genome length too small
GCA_907173315.1 PRJEB40874      SAMEA8703778    CAJSGD000000000.1       na      172733  172733  uncultured Eubacteriales bacterium              ERR1543778_bin.2_metaWRAP_v1.1_MAG      latest  Contig  Major   Full    2021/05/16    ERR1543778_bin.2_metaWRAP_v1.1_MAG      EMG     na      na      ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/907/173/315/GCA_907173315.1_ERR1543778_bin.2_metaWRAP_v1.1_MAG       derived from environmental source; derived from metagenome
Above 6 accessions are new

This should print the lines that are different in new file. You can either simply save them to a new file or edit the code above to write to a file.

ADD REPLY
0
Entering edit mode

Thank you very much for the time you took so I could have an answer especially in python. but unfortunately I ran your program and I got several lines (I think all the lines) and at the end I got this message "Above 903832 accessions are new".

ADD REPLY
0
Entering edit mode

Files I have were directly downloaded from NCBI FTP site and that is what my code is based/tested on. You also can't compare RefSeq genomes to GenBank genomes. Their ID's are different. Not sure if you are trying to do that.

ADD REPLY
0
Entering edit mode

Hello, I have this error message please. I would just like to understand the f comes From where please. File "ncbi.py", line 35 print(f "Above{i} accessions are new!") ^ SyntaxError: invalid syntax

ADD REPLY
0
Entering edit mode

Sounds like you are using python v.2.x. f-strings were introduced in python 3.x. They are a way of formatting printed output.

You could replace that last line with print("Above", i ,"accessions are new") to remove that error.

ADD REPLY
0
Entering edit mode

The BIRCH bioinformatics system has an easy to use set of point and click tools for managing a local copy of BLAST databases, including updates. You can see a demonstration in the video Installing BLAST databases on your own computer. Aside from the step by step demo, this video also covers most of the salient points you need to know about what is involved with having local BLAST databases. You can also see the BIRCH documentation pages on Local BLAST Databases.

ADD REPLY
0
Entering edit mode

This question is not about blast. OP is comparing a summary report of sequenced genomes over a time period in Genbank Genomes FTP site.

ADD REPLY

Login before adding your answer.

Traffic: 1451 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6