I'm trying to run velocyto
with the run-smartseq2
command on about 1500 bam files from alignment with subread-align
.
The problem is, the job takes forever and failed. It ran for 14 hours before I think it exhausted the memory on the compute node (using an HPC, 128 GB RAM).
2021-08-03 08:40:40,913 - DEBUG - Reading /endosome/work/InternalMedicine/s184335/genome.med.nyu.edu/results/external/parklab/2018-05-09-WCMC/allfastq_files/MDS_trimmed_fqfiles/subread_aligned/C2-G12_S322_L006.bam
2021-08-03 08:40:40,999 - DEBUG - Read first 0 million reads
2021-08-03 08:45:20,476 - DEBUG - Counting for batch 69, containing 1 cells and 8673372 reads
2021-08-03 08:52:02,537 - DEBUG - 1110320 reads in repeat masked regions
2021-08-03 08:52:02,538 - DEBUG - 4299289 reads overlapping with features on plus strand
2021-08-03 08:52:02,538 - DEBUG - 4150071 reads overlapping with features on minus strand
2021-08-03 08:52:02,538 - DEBUG - 984169 reads overlapping with features on both strands
2021-08-03 08:54:12,717 - WARNING - The barcode selection mode is off, no cell events will be identified by <80 counts
2021-08-03 08:54:12,718 - WARNING - 0 of the barcodes where without cell
2021-08-03 08:54:15,469 - DEBUG - Reading /endosome/work/InternalMedicine/s184335/genome.med.nyu.edu/results/external/parklab/2018-05-09-WCMC/allfastq_files/MDS_trimmed_fqfiles/subread_aligned/C2-H02_S321_L006.bam
2021-08-03 08:54:15,572 - DEBUG - Read first 0 million reads
slurmstepd: error: get_exit_code task 0 died by signal
The velocyto manual says that running "a typical sample" should take about 6 hours.
I've tried re-running with a subset of about 40 bam files (each bam file is a cell), but the pace seems to be about the same; it hasn't completed at the time of writing and has been running for over 3 hours.
Looking at the log file, the vast majority of the time seems to be taken up by counting the reads in the bam files (above output).
The command I've run is this:
velocyto run-smartseq2 -o test_MDS_RNAvelocity -m ../hg38_rmsk.gtf -e MDS_HSC_RNAvelocity *.bam /endosome/work/InternalMedicine/s184335/genome_folder/alias/hg38/ensembl_gtf/default/hg38.gtf
Has anyone used velocyto
for smart-seq2 data and experienced this sort of problem?
Is this amount of time and resources used by velocyto
normal? Surely 1589 cells shouldn't take this long to process?
Would there be any way to make it more efficient?
Edit:
It says that the program will determine the cell barcodes while reading the bam file, which might be the problem, but this is smart-seq2 data and the command run-smartseq2
does not have an option for specifying a barcode set.
2021-08-02 22:28:20,899 - WARNING - Each bam file will be interpreted as a DIFFERENT cell
2021-08-02 22:28:20,900 - DEBUG - Using logic: SmartSeq2
2021-08-02 22:28:20,900 - DEBUG - Cell barcodes will be determined while reading the .bam file
Also later on, the program says that the barcode selection mode is off.
Is there something wrong with the command or an option I'm forgetting to pass?
Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.