Adding plasmid sequences to gtf file
0
1
Entering edit mode
6.7 years ago

Hello,

I am trying to align some .fastq files to both the hg19 human genome build AND a couple of plasmid sequences.

I am trying to figure out how I can include these plasmid sequences as "chromosomes" in the .gtf file.

The lines I tried to add look like this (product of >>tail myGTF.gtf) (my plasmids are plas_hsk, plas_hul, and plas_shp):

chrY    unknown stop_codon  59343078    59343080    .   +   .   gene_id "IL9R"; gene_name "IL9R"; p_id "P21953"; transcript_id "NM_002186_1"; tss_id "TSS15302";
chrY    unknown exon    59358329    59359508    .   -   .   gene_id "DDX11L16"; gene_name "DDX11L16"; transcript_id "NR_110561_1"; tss_id "TSS3419";
chrY    unknown exon    59360007    59360115    .   -   .   gene_id "DDX11L16"; gene_name "DDX11L16"; transcript_id "NR_110561_1"; tss_id "TSS3419";
chrY    unknown exon    59360501    59360854    .   -   .   gene_id "DDX11L16"; gene_name "DDX11L16"; transcript_id "NR_110561_1"; tss_id "TSS3419";
plas_hsk    AddedGenes  exon    1   12915   .   +   0   gene_id "plas_hsk"; gene_name "plas_hsk"; transcript_id "plas_hsk"; tss_id "plas_hsk";
plas_hul    AddedGenes  exon    1   12262   .   +   0   gene_id "plas_hul"; gene_name "plas_hul"; transcript_id "plas_hul"; tss_id "plas_hul";
plas_shp    AddedGenes  exon    1   11886   .   +   0   gene_id "plas_shp"; gene_name "plas_shp"; transcript_id "plas_shp"; tss_id "plas_shp";[kmuench@smsx10srw-srcf-d15-37 20180223_alignToPlasmidOnly]

Unfortunately, when I run STAR, I get this error message.

Fatal INPUT FILE error, no valid exon lines in the GTF file: /path/to/my/gtf/myGTF.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

This suggests to me that something is wrong about how I formatted my GTF file - but I can't figure out what's wrong with it, as it's just the Illumina GTF file plus a couple of lines I added in order to make "chromosomes" representing my plasmids.

BTW here is the STAR command I am using ($gtf points to the path to myGTF.gtf):

STAR --runMode genomeGenerate \\
     --genomeDir $myGenomeDir \\
     --genomeFastaFiles $hg19 $plasmidFasta \\
     --sjdbGTFfile $gtf \\
     --sjdbOverhang 100 \\
     --genomeSAindexNbases 5 \\
     --runThreadN ${SLURM_NPROCS:-1} \\
     --readFilesIn ${workingDir}/$Read1 ${workingDir}/$Read2 \\
     --outReadsUnmapped Fastx \\
     --scoreDelOpen -10000 --scoreInsOpen -10000 \\
     --outFileNamePrefix ${workingDir}/${outputFileLoc}/${sample}_

Thanks for your help!

Kristin

RNA-Seq • 4.3k views
ADD COMMENT
1
Entering edit mode

I think you're missing a new-line at the end of the GTF file

ADD REPLY
0
Entering edit mode

In addition to this, you may want to be comprehensive about this and a complete entry for your gene, which would include lines for gene transcript and then multiple exon entries (for multiple exon transcripts). Take a look at other full transcripts in your GTF to see how they're recorded.

ADD REPLY
0
Entering edit mode

Thanks! I'll make these changes and give it a try/report back.

ADD REPLY
1
Entering edit mode

Did you check this "Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file."?

ADD REPLY
0
Entering edit mode

I thought I had, but it looks like the .fa file had the gene names in upper case. I'll fix it to make them match exactly and report back on the result.

ADD REPLY
1
Entering edit mode

Update: it now runs, but unfortunately the alignment is taking a lot of time (5 days on the step "sorting Suffix Array chunks and saving them to disk..."). Will keep working on it, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1847 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6