Hello,
I'm in the process of creating an alternate reference genome by using a VCF file and a reference genome that are inputed into GATK's faster alternate reference maker. The whole process takes longer than I wanted and I realize this is because it is making changes to repeat regions and places that are not transcripts. I then took the intervals from the gtf file used the -L option to focus only on transcript regions. However this outputs a file with only the transcripts and not the standard genome.fa file where each chromosome is by itself and referenced as ">10" for example. Now there are multiple ">10" and I think those are all the transcripts by themselves. This might be a trivial question. I'm new to this and wanted to ask for advice. I also know that I can post this on the GATK website but I wanted to also see if there are alternate ways/platforms to use that might be faster. Is there a way to make GATK's faster alternate reference maker faster and make changes? Or is there another format I can use? Also is there a way to tell it to focus only on transcripts, but to output a normal genome faster format file?
Thank you very much in advance!
Surely, you mean GATK FastaAlternateReferenceMaker
What is the exact command you're using right now? GATK specifies parallelization options in the common CommandLineTools page
Thank you for responding!!
I'm using the following command:
I made the file VCF.intervals using a python script I wrote that takes the coordinates from the GTF file and sorts it based on amending order. I also made the file in the format they mention on their website. I have the website linked here: GATK Interval File Format
I used the B format GATK-style .list or .intervals and here is a snippet of what my VCF.intervals file looks like:
By the way, in my file there is no space between the lines. Here I don't know why it places the lines next to each other with just one space. I know I used chromosome 10 multiple times, but I don't know of a way to say it once and indicate the interval ranges on one line, without telling it to go through the whole genome again and take the same amount of time as before. The problem is my output file is vastly different from when I don't have intervals. There are multiple chromosome 10s again and I think those are just the transcript ranges and not the full genome it should output.
Thank you and please help!!
code
option) to present your post better. I've done it for you this time.ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.Thank you!