Hello!
I am trying to do a computational biology project for my school’s science fair and I want to download raw sequencing reads off the SRA database. However these reads are a lot larger than I thought and I’m worried what will happen if my computer runs out of space. How many samples should I have in general for each independent variable? The study has 99 different samples. Should I just buy an external hardrive or something and download all of the files, or could I only use, perhaps, 50 of the 99 samples?
For context I have a MacBook Pro with 1 TB of storage and this is the data that I want to use for my project: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA879084
I’m new to computational biology, so any suggestions would be greatly appreciated!
Another problem - even if you buy a big external SSD (not a hard disk) is that your computer likely does not have enough RAM to align the sequences to the genome. From memory an aligner like STAR can use over 32 GB of RAM, HiSat2 is likely more efficient. Another more resource efficient route would be to look at Kallisto or Salmon (both on github). But getting the count table as ATpoint says is likely the best option.