Hello Community,
I have my ATAC-Seq data from zebrafish. I have completed the quality control steps and now I want to align my reads to the zebrafish reference genome. I am thinking of using the Encode pipeline but it has only the genome of human and mouse in built. They have the specification to build genome database for your own genome, which are:
1. You can build your own genome database if your reference genome has one of the following file types.
.fasta.gz
.fa.gz
.fasta.bz2
.fa.gz2
.2bit
2. Get a URL for your reference genome. You may need to upload it to somewhere on the internet.
3. Get a URL for a gzipped blacklist BED file for your genome. If you don't have one then skip this step. An example blacklist for hg38 is here.
4. Find the following lines in scripts/build_genome_data.sh and modify them as follows. Give a good name [YOUR_OWN_GENOME
] for your genome. For MITO_CHR_NAME
use a correct mitochondrial chromosome name of your genome (e.g. chrM or MT). For REGEX_BFILT_PEAK_CHR_NAME
Perl style regular expression must be used to keep regular chromosome names only in a blacklist filtered (.bfilt.
) peaks files. This .bfilt.
peak files are considered final peaks output of the pipeline and peaks BED files for genome browser tracks (.bigBed
and .hammock.gz
) are converted from these .bfilt.
peaks files. Chromosome name filtering with REGEX_BFILT_PEAK_CHR_NAME
will be done even without the blacklist itself.
I wanted to know where can I find the zebrafish reference genome with all those formats mentioned. As I am new to this, so I am not sure about how to create the index file.
I shall be very thankful if someone can provide the resources and help me with it.
Thank you
It is very useful to work out what files the pipeline you are running uses, what the contents of these files are and what they are needed for. Blindly providing inputs will lead to a very difficult debugging or even data analysis process.
You can start by looking on Ensembl for the genome FASTA.