Question

Micro analysis pipeline

0

Entering edit mode

3.6 years ago

aranyak111 • 0

I am very new to microanalysis. I have 6 fastq files of human microRNA. the first few lines of the fastq files which is of about 25 nucleotide length.I have trimmed the reads using bbduk. The trimmed reads have been aligned using custom bowtie scripts.

The human genome used to build the reference is a bowtie inbuilt reference H. sapiens, GRCh38 + major SNVs. The custom made scripts used for bowtie are module load Bowtie2/2.4.2-GCCcore-10.2.0 ; bowtie2 -x grch38_1kgmaj -U 3kPa_HDF_miRNA_1bbduk.fastq -S 3kPa_HDF_miRNA_2bbduk.sam for one representative FASTq sequence.The reads have been converted and sorted by position using scripts given in bowtie website

samtools  view -bS 3kPa_HDF_miRNA_2bbduk.sam >  3kPa_HDF_miRNA_2bbduk.bam

then samtools sort 3kPa_HDF_miRNA_2bbduk.bam to sort the bam file. The first few lines of the samfile looks like this

@HD     VN:1.0  SO:unsorted
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717

After alignment, I am using the miRNA part of the GR38.gtf file to annotate whose first few lines look like this.

chr1    .       miRNA_primary_transcript        17369   17436   .       -       .       ID "MI0022705"; Alias "MI0022705"; Name "hsa-mir-6859-1
chr1    .       miRNA   17409   17431   .       -       .       ID "MIMAT0027618"; Alias "MIMAT0027618"; Name "hsa-miR-6859-5p"; Derives_from "MI0022705
chr1    .       miRNA   17369   17391   .       -       .       ID "MIMAT0027619"; Alias "MIMAT0027619"; Name "hsa-miR-6859-3p"; Derives_from "MI0022705
chr1    .       miRNA_primary_transcript        30366   30503   .       +       .       ID "MI0006363"; Alias "MI0006363"; Name "hsa-mir-1302-2
chr1    .       miRNA   30438   30458   .       +       .       ID "MIMAT0005890"; Alias "MIMAT0005890"; Name "hsa-miR-1302"; Derives_from "MI0006363
chr1    .       miRNA_primary_transcript        187891  187958  .       -       .       ID "MI0026420"; Alias "MI0026420"; Name "hsa-mir-6859-2
chr1    .       miRNA   187931  187953  .       -       .       ID "MIMAT0027618_1"; Alias "MIMAT0027618"; Name "hsa-miR-6859-5p"; Derives_from "MI0026420

I am using htaseq to count the reds using scripts like

module load HTSeq; htseq-count -f bam -r pos /gpfs/ycga/scratch60/nicoli/ag2646/HumanmicroRNA_bowtie/3kPa_HDF_miRNA_1bbduk.sorted.bam /gpfs/ycga/scratch60/nicoli/ag2646/RNAseqmicroRNAHuman/genes.gtf > /gpfs/ycga/scratch60/nicoli/ag2646/HumanmicroRNA_bowtie/3kPa_HDF_miRNA_1count.txt.

However, I am getting the error called incorrect quote at line 14 in gtf file.

MY question is this the right annotation file? If not what annotation file I need to use for microRNA mapping so as to get microRNA counts? Any suggestion will be very helpful.

RNA-Seq • 1.5k views

ADD COMMENT • link updated 3.6 years ago by Sukjun Kim ▴ 90 • written 3.6 years ago by aranyak111 • 0

0

Entering edit mode

It would be good if you provide an error message of htseq-count and show line 14 of GTF file.

ADD REPLY • link 3.6 years ago by Sukjun Kim ▴ 90

0

Entering edit mode

Line 14 of the gtf file looks like this

chr1 . miRNA_primary_transcript 17369 17436 . - . ID "MI0022705"; Alias "MI0022705"; Name "hsa-mir-6859-1.

The exact error looks like this Error processing GFF file (line 14 of file /gpfs/ycga/scratch60/nicoli/ag2646/HumanmicroRNA_bowtie/Homo_sapiens.GRCh38.miRNA.gtf): unmatched quote [Exception type: ValueError, raised in _HTSeq.pyx:1608]

ADD REPLY • link 3.6 years ago by aranyak111 • 0

1

Entering edit mode

Literally, there is an omitted quote(") at the end of line. Try to download the GFF3 or GTF file again or fix the particular line ("hsa-mir-6859-1 to "hsa-mir-6859-1").

And, did you convert miRBase GFF3 file to GTF? I think something is wrong in the process of conversion.

ADD REPLY • link 3.6 years ago by Sukjun Kim ▴ 90

0

Entering edit mode

The file head shows it to be GFF3. To be handled by HTseq should I convert GFF3 to GTF format? Is there a convenient way of doing so? the first few lines of my file look like this.

##gff-version 3
##date 2014-6-22
#
# Chromosomal coordinates of Homo sapiens microRNAs
# microRNAs:               miRBase v21
# genome-build-id:         GRCh38
# genome-build-accession:  NCBI_Assembly:GCA_000001405.15
#
# Hairpin precursor sequences have type "miRNA_primary_transcript".
# Note, these sequences do not represent the full primary transcript,
# rather a predicted stem-loop portion that includes the precursor
# miRNA. Mature sequences have type "miRNA".
#
chr1    .       miRNA_primary_transcript        17369   17436   .       -       .       ID "MI0022705"; Alias "MI0022705"; Name "hsa-mir-6859-1

ADD REPLY • link updated 3.6 years ago by i.sudbery 20k • written 3.6 years ago by aranyak111 • 0

0

Entering edit mode

Although comment lines are saying that it is GFF3 format, I think it is actually GTF format judging from the attribute column format. The software tool you used for conversion of GFF to GTF might keep these comment lines in GTF file even after the conversion was finished. See https://genome.ucsc.edu/FAQ/FAQformat.html#format4 for more information.

And I think that you don't need to convert GFF to GTF format. I remember that htseq-count can accept GFF format as an input. I found the general usage of it from https://htseq.readthedocs.io/en/release_0.9.0/count.html .

htseq-count [options] <alignment_files> <gff_file>

Try it with the original GFF file from miRBase, ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3

ADD REPLY • link 3.6 years ago by Sukjun Kim ▴ 90

0

Entering edit mode

What you show as a GTF file appears not to be a GTF file, but in fact to be a GFF3 file. I don't know if HTSeq can handle GFF3 files.

ADD REPLY • link 3.6 years ago by i.sudbery 20k