I am trying to process single cell data from a public dataset, containing over 6000 single-end 51 bp fastq files. Each fastq file represents a single cell and contains a 9-11 bp UMI, leaving 40 bp for mapping.
I have used UMI-tools to extract the UMI sequence from every read in every file.
Is there an efficient way to handle such a large number of files for STAR or kallisto? It feels as though this process would be faster if I had fewer, barcoded, larger fastq files?
Individual files you have are what size? Keeping the genome index in memory, using multiple threads you may be able to wade through these quicker than you imagine.
One could always fire up 6000 VM's on cloud and be done. That is if money is no object :-)
Thanks for everyone's help! In the end, I tried both approaches: running STAR mapping jobs in parallel (worked, very fast), however I am now going down the route of merging the fastq files together with artificial barcodes inserted, as I would like to try the zUMIs package, which only handles 4 fastq files at a time. I used the Illumina list of barcodes to insert a unique barcode into the start of every read in every file.