Hello,
I am trying to align some .fastq files to both the hg19 human genome build AND a couple of plasmid sequences.
I am trying to figure out how I can include these plasmid sequences as "chromosomes" in the .gtf file.
The lines I tried to add look like this (product of >>tail myGTF.gtf) (my plasmids are plas_hsk, plas_hul, and plas_shp):
chrY unknown stop_codon 59343078 59343080 . + . gene_id "IL9R"; gene_name "IL9R"; p_id "P21953"; transcript_id "NM_002186_1"; tss_id "TSS15302";
chrY unknown exon 59358329 59359508 . - . gene_id "DDX11L16"; gene_name "DDX11L16"; transcript_id "NR_110561_1"; tss_id "TSS3419";
chrY unknown exon 59360007 59360115 . - . gene_id "DDX11L16"; gene_name "DDX11L16"; transcript_id "NR_110561_1"; tss_id "TSS3419";
chrY unknown exon 59360501 59360854 . - . gene_id "DDX11L16"; gene_name "DDX11L16"; transcript_id "NR_110561_1"; tss_id "TSS3419";
plas_hsk AddedGenes exon 1 12915 . + 0 gene_id "plas_hsk"; gene_name "plas_hsk"; transcript_id "plas_hsk"; tss_id "plas_hsk";
plas_hul AddedGenes exon 1 12262 . + 0 gene_id "plas_hul"; gene_name "plas_hul"; transcript_id "plas_hul"; tss_id "plas_hul";
plas_shp AddedGenes exon 1 11886 . + 0 gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";[kmuench@smsx10srw-srcf-d15-37 20180223_alignToPlasmidOnly]
Unfortunately, when I run STAR, I get this error message.
Fatal INPUT FILE error, no valid exon lines in the GTF file: /path/to/my/gtf/myGTF.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.
This suggests to me that something is wrong about how I formatted my GTF file - but I can't figure out what's wrong with it, as it's just the Illumina GTF file plus a couple of lines I added in order to make "chromosomes" representing my plasmids.
BTW here is the STAR command I am using ($gtf points to the path to myGTF.gtf):
STAR --runMode genomeGenerate \\
--genomeDir $myGenomeDir \\
--genomeFastaFiles $hg19 $plasmidFasta \\
--sjdbGTFfile $gtf \\
--sjdbOverhang 100 \\
--genomeSAindexNbases 5 \\
--runThreadN ${SLURM_NPROCS:-1} \\
--readFilesIn ${workingDir}/$Read1 ${workingDir}/$Read2 \\
--outReadsUnmapped Fastx \\
--scoreDelOpen -10000 --scoreInsOpen -10000 \\
--outFileNamePrefix ${workingDir}/${outputFileLoc}/${sample}_
Thanks for your help!
Kristin
I think you're missing a new-line at the end of the GTF file
In addition to this, you may want to be comprehensive about this and a complete entry for your gene, which would include lines for
gene
transcript
and then multipleexon
entries (for multiple exon transcripts). Take a look at other full transcripts in your GTF to see how they're recorded.Thanks! I'll make these changes and give it a try/report back.
Did you check this "Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file."?
I thought I had, but it looks like the .fa file had the gene names in upper case. I'll fix it to make them match exactly and report back on the result.
Update: it now runs, but unfortunately the alignment is taking a lot of time (5 days on the step "sorting Suffix Array chunks and saving them to disk..."). Will keep working on it, thanks!