Hello everyone,
My question relates to shotgun metagenomics.
I would like to calculate relative abundance based on reads mapped to contigs. Typically, I deduplicate all contigs using cd-hit and treat them as a reference, then map my FASTQ files to these references to create feature counts. However, with nearly 100 samples, my deduplicated references are becoming too large (almost 50GB), which is causing performance issues.
I noticed that the CAT-BAT-RAT suite provides an intelligent set of tools within CAT_pack. However, I’m encountering some issues and would appreciate your assistance.
When running RAT, I get an error related to the reads, stating that the FASTQ headers must match those in the contigs. My contigs were generated by MEGAHIT and are named in the format >k127_10000, without any reference to FASTQ headers. How can I resolve this mismatch?
/CAT_pack reads --mode cr -c contigs.fa -1 R_1.fastq -2 R_2.fastq -d /path/to/20240422_CAT_nr/db/ -t /path/to/200422_CAT_nr/tax/ -o lala -p out.CAT.predicted_proteins.faa -a out.CAT.alignment.diamond --c2c out.CAT.contig2classification.txt <frozen genericpath>:39: RuntimeWarning: bool is used as a file descriptor # CAT_pack v6.0. [2024-10-11 09:41:47] samtools found: samtools 1.21. [2024-10-11 09:41:47] bwa found: Version: 0.7.18-r1243-dirty. RAT is running. Mapping reads against assembly with bwa mem. [2024-10-11 09:41:47] Running bwa mem for read mapping. File fastq.bwamem.sorted will be generated.Do not forget to cite bwa mem and samtools when using RAT in your publication! [2024-10-11 09:41:47] Contigs fasta is already indexed. [2024-10-11 09:41:47] Running bwa mem... ... [main] Version: 0.7.18-r1243-dirty [main] CMD: bwa mem -t 24 [main] Real time: 118.429 sec; CPU: 2701.163 sec [2024-10-11 09:43:52] Sorting bam file... [bam_sort_core] merging from 0 files and 24 in-memory blocks... [2024-10-11 09:44:00] Read mapping done! [2024-10-11 09:44:00] contig2classification file supplied. Processing contig classifications. [2024-10-11 09:44:00] Loading file 20240422_CAT_nr/tax/nodes.dmp. [2024-10-11 09:44:03] Processing mapping file(s). [2024-10-11 09:44:32] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied. [2024-10-11 09:44:32] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences... [2024-10-11 09:44:33] Contigs written! Appending forward reads... Traceback (most recent call last): File "CAT_pack/CAT_pack/./CAT_pack", line 101, in <module> main() ~~~~^^ File "CAT_pack/CAT_pack/./CAT_pack", line 85, in main reads.run() ~~~~~~~~~^^ File "CAT_pack/CAT_pack/reads.py", line 435, in run make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'], ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ uncl_unm_fasta, 'fastq', 'a','_1') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "CAT_pack/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta fasta_dict[seq])) ~~~~~~~~~~^^^^^ KeyError: 'FT100012261L1C013R00202177739'
It seems like RAT requires CAT.contig2classification.txt to be run first, although the tutorial doesn’t mention this. Is running CAT mandatory before RAT?
- Finally, does the RAT method (mapping reads to contigs) require a re-run of the deduplication step to accurately calculate relative abundance, or can I use the original deduplicated contigs?
Thank you for your help!
Best regards, Nhu