In my quest to build my own mapping using the Fastq files, read 1 and 2, given to me by the sequencing intermediary I have:
checked for quality with FASTQC, the nucleotide individual reads were of very good quality, but I am still worried about the scars diversity of the quality scores within the files and FASTQC would not report on such figure (I have not seen it, nor I found a FASTQC command to calculate it)
used a reference genome downloaded from the NCBI to construct a new genome index directory. I used the GRCh38.p13. Not aware of the GTF GFF files that apparently are a direct download that might have saved me time (is there a place to have more information on them, notably for human genome mapping?).
I then launched successfully on a 16 CPU Threads machine with 60 GB of ram the mapping using STAR and adding the option "--twopassMode Basic". The first pass was generated after a couple of hours, but the second pass incurred a memory error, here is the exit message:
Aug 15 04:35:56 ..... started sorting BAM Max memory needed for sorting = 4537708471 *EXITING because of FATAL ERROR: number of bytes expected from the BAM bin does not agree with the actual size on disk: Expected bin size=3846976240 ; size on disk=1328263168 ; bin number=47 Aug 15 04:37:12 ...... FATAL ERROR, exiting*
I suppose there was not enough space, is there a way to take back the previously generated files once I have added space with STAR? Avoiding a complete recalculation, I haven't seen anything about it in the documentation.
The command line used for mapping:
sudo nohup STAR
--runThreadN 16 \
--readFilesIn ~/r1.fastq.gz ~/r2.fastq.gz \
--genomeDir ~/hg38_index \
--outFileNamePrefix polly \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard \
--twopassMode Basic
Wishing you a nice weekend,