Entering edit mode
3.5 years ago
Debut
▴
20
Hello everyone, I am a beginner in bioinformatics and I am a bit stuck. I downloaded the FTP summary GenBank table with the "wget" command a week ago. I've also downloaded the same table today, I'd like to compare the rows of the two tables to see if there is a new annotation. I don't want to compare the columns of the table but the rows. it's to see if there are rows that have been added. it's like an update system.
We had recommend
genome_updater
(LINK) tool to you in a previous thread. One benefit of using a tool like this is that it will keep track of changes. You could useto see the changes or use following options to see changes
thank you for your answer , I have to put this command line "-k: dry-run - do not perform any download or update, but shows number of files to be downloaded or updated"? but how will the system know that I'm talking about the two tables whose change I want to see? the name of the file is "assembly_summary_genbank.txt". it is to compare two tables downloaded from the same place but not at the same time. I would like to compare line by line if it is possible and see the lines that have been added.
This was not an answer for your specific question. It was a recommendation that if you want to do this regularly then you should use a tool meant for downloading genome data from NCBI. It would make your life simpler. I moved my answer to a comment.
You could try unix utility
diff
(google for how to use it) to compare the files. You may not be able to find a pre-written program but one can write something to this comparison using python or suitable language.Thank you for your answer, I hope to find a solution to this problem. I will see with python.
See this thread: https://unix.stackexchange.com/questions/428419/how-to-write-the-difference-between-two-files-into-a-file
should give you new lines in
file2.txt
. This is assuming NCBI appends new lines at end of these summary files. If not the input files will need to be sorted.Thank you very much for your answers but as I am not sure that NCBI adds the new lines at the end, I tried your second code but I have a file not with the new lines but with all the lines I think.
To convince you that should not be happening I tried a couple of things with some files I had sitting around that were downloaded at different times. While the lines are not precisely adding up I am not getting the same number of lines in my final files.
As I said before writing a proper parsing program is the surefire way to make this work.
Disclaimer: If you are not downloading the report files directly from NCBI FTP site with
wget/curl
(i.e. if you are saving them from a web browser or something else) then above may or may not work.I can't find a command that can compare each line of the second file with all lines of the first file. If there are lines that do not exist in the first file, put them in a file that can be called new.txt
Save the following code in a file (say
ncbi.py
).Then run it as follows
You should see output like this
This should print the lines that are different in new file. You can either simply save them to a new file or edit the code above to write to a file.
Thank you very much for the time you took so I could have an answer especially in python. but unfortunately I ran your program and I got several lines (I think all the lines) and at the end I got this message "Above 903832 accessions are new".
Files I have were directly downloaded from NCBI FTP site and that is what my code is based/tested on. You also can't compare RefSeq genomes to GenBank genomes. Their ID's are different. Not sure if you are trying to do that.
Hello, I have this error message please. I would just like to understand the f comes From where please. File "ncbi.py", line 35 print(f "Above{i} accessions are new!") ^ SyntaxError: invalid syntax
Sounds like you are using
python v.2.x
.f-strings
were introduced inpython 3.x
. They are a way of formatting printed output.You could replace that last line with
print("Above", i ,"accessions are new")
to remove that error.The BIRCH bioinformatics system has an easy to use set of point and click tools for managing a local copy of BLAST databases, including updates. You can see a demonstration in the video Installing BLAST databases on your own computer. Aside from the step by step demo, this video also covers most of the salient points you need to know about what is involved with having local BLAST databases. You can also see the BIRCH documentation pages on Local BLAST Databases.
This question is not about blast. OP is comparing a summary report of sequenced genomes over a time period in Genbank Genomes FTP site.