Entering edit mode
19 months ago
Francois Piumi
▴
70
Hi, if I understood well, a kraken library only contains human, bacterial and viruses taxonomy.
I noticed that it was possible to add another genome, as follows:
kraken-build --add-to-library chr1.fa --db $DBNAME
So I downloaded a genome, and write the following line:
kraken2-build --add-to-library Culicoides_sonorensis.Cson1.dna_rm.toplevel.fa --db Kraken2_Standard_Fev2019
Here the output error message:
scan_fasta_file.pl: unable to determine taxonomy ID for sequence scaffold40
Indeed, there isn't any taxonomy information in the fasta file (header example :
>scaffold40 dna:supercontig supercontig:Cson1:scaffold40:1:766034:1 REF)
So, how Kraken does to retrieve a taxonomy information from a fasta file? Is there a specific fasta format to download?
I transformed all my sequences ids according to the manual. Krakenbuild accepted them ("Culicoides_sonorensis.fa" was added to the kraken library "Kraken2_Standard_Fev2019").
But there isn't any trace of "Culicoides_sonorensis" in the report after analysis of a fastq file of Culicoides RNA-Seq sequences....
It is not exactly clear if we must add a description after the "sequence16|kraken:taxid|32630" from the manual
And it is also not clear if all sequences must be added one by one like in the manual (chr1.fa, chr2.fa)