I am very new to microanalysis. I have 6 fastq files of human microRNA. the first few lines of the fastq files which is of about 25 nucleotide length.I have trimmed the reads using bbduk. The trimmed reads have been aligned using custom bowtie scripts.
The human genome used to build the reference is a bowtie inbuilt reference H. sapiens, GRCh38 + major SNVs. The custom made scripts used for bowtie are
module load Bowtie2/2.4.2-GCCcore-10.2.0 ; bowtie2 -x grch38_1kgmaj -U 3kPa_HDF_miRNA_1bbduk.fastq -S 3kPa_HDF_miRNA_2bbduk.sam
for one representative FASTq sequence.The reads have been converted and sorted by position using scripts given in bowtie website
samtools view -bS 3kPa_HDF_miRNA_2bbduk.sam > 3kPa_HDF_miRNA_2bbduk.bam
then samtools sort 3kPa_HDF_miRNA_2bbduk.bam
to sort the bam file. The first few lines of the samfile looks like this
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
After alignment, I am using the miRNA part of the GR38.gtf file to annotate whose first few lines look like this.
chr1 . miRNA_primary_transcript 17369 17436 . - . ID "MI0022705"; Alias "MI0022705"; Name "hsa-mir-6859-1
chr1 . miRNA 17409 17431 . - . ID "MIMAT0027618"; Alias "MIMAT0027618"; Name "hsa-miR-6859-5p"; Derives_from "MI0022705
chr1 . miRNA 17369 17391 . - . ID "MIMAT0027619"; Alias "MIMAT0027619"; Name "hsa-miR-6859-3p"; Derives_from "MI0022705
chr1 . miRNA_primary_transcript 30366 30503 . + . ID "MI0006363"; Alias "MI0006363"; Name "hsa-mir-1302-2
chr1 . miRNA 30438 30458 . + . ID "MIMAT0005890"; Alias "MIMAT0005890"; Name "hsa-miR-1302"; Derives_from "MI0006363
chr1 . miRNA_primary_transcript 187891 187958 . - . ID "MI0026420"; Alias "MI0026420"; Name "hsa-mir-6859-2
chr1 . miRNA 187931 187953 . - . ID "MIMAT0027618_1"; Alias "MIMAT0027618"; Name "hsa-miR-6859-5p"; Derives_from "MI0026420
I am using htaseq to count the reds using scripts like
module load HTSeq; htseq-count -f bam -r pos /gpfs/ycga/scratch60/nicoli/ag2646/HumanmicroRNA_bowtie/3kPa_HDF_miRNA_1bbduk.sorted.bam /gpfs/ycga/scratch60/nicoli/ag2646/RNAseqmicroRNAHuman/genes.gtf > /gpfs/ycga/scratch60/nicoli/ag2646/HumanmicroRNA_bowtie/3kPa_HDF_miRNA_1count.txt.
However, I am getting the error called incorrect quote at line 14 in gtf file.
MY question is this the right annotation file? If not what annotation file I need to use for microRNA mapping so as to get microRNA counts? Any suggestion will be very helpful.
It would be good if you provide an error message of htseq-count and show line 14 of GTF file.
Line 14 of the gtf file looks like this
chr1 . miRNA_primary_transcript 17369 17436 . - . ID "MI0022705"; Alias "MI0022705"; Name "hsa-mir-6859-1.
The exact error looks like this Error processing GFF file (line 14 of file /gpfs/ycga/scratch60/nicoli/ag2646/HumanmicroRNA_bowtie/Homo_sapiens.GRCh38.miRNA.gtf): unmatched quote [Exception type: ValueError, raised in _HTSeq.pyx:1608]
Literally, there is an omitted quote(") at the end of line. Try to download the GFF3 or GTF file again or fix the particular line ("hsa-mir-6859-1 to "hsa-mir-6859-1").
And, did you convert miRBase GFF3 file to GTF? I think something is wrong in the process of conversion.
The file head shows it to be GFF3. To be handled by HTseq should I convert GFF3 to GTF format? Is there a convenient way of doing so? the first few lines of my file look like this.
Although comment lines are saying that it is GFF3 format, I think it is actually GTF format judging from the attribute column format. The software tool you used for conversion of GFF to GTF might keep these comment lines in GTF file even after the conversion was finished. See https://genome.ucsc.edu/FAQ/FAQformat.html#format4 for more information.
And I think that you don't need to convert GFF to GTF format. I remember that htseq-count can accept GFF format as an input. I found the general usage of it from https://htseq.readthedocs.io/en/release_0.9.0/count.html .
Try it with the original GFF file from miRBase, ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3
What you show as a GTF file appears not to be a GTF file, but in fact to be a GFF3 file. I don't know if HTSeq can handle GFF3 files.