Entering edit mode
6.2 years ago
c7750
•
0
I want to run continuous integration for my software which depends on genome data from the UCSC genome browser in 2bit format. The file it depends on is 800 MB, which is too large for GitHub. How can I split one of these files to have a manageable size for testing? Is there a way I can split by chromosome or genome position?
What does that mean? Asking as someone who is not a software developer?
Can you not link the 2bit files directly from UCSC providing instructions on what people should do with the download, if you need the files for your software?
It means a third-party is running my tests whenever I push a change. I can't link, because a computer is running my program.
You probably want your tests to be over small examples, like a small chromosome, or even just a fragment of a chromosome.
You can split a fasta file with samtools (among dozens of other options, see How To Split A Multiple Fasta which ironically doesn't include a samtools solution), and convert the small fasta to 2bit with faToTwoBit.