Question

how can I align several fastq files with hisat2 and get htseq-count

2

Entering edit mode

6.8 years ago

Learner ▴ 280

Hello,

I have been strugelling to run Hisat2 code for alignment of several fastq files from human I have read as many post as I could online and I went through the protocol here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032908/

At the moment, i have the latest binary version of hisat2 in $HOME/bin When I do hisat2 in the terminal and shows that I don't import any input but it indicates that it is there and can be invoked

No index, query, or output file specified! HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo) Usage:

I have 10 fasta files on a folder named data on my desktop. I have put the three scripts attached to the protocl in that folder and the files are shown below

NIHMS816842-supplement-README.txt
RNAseq_shell_script.sh
SRR1_1.fastq.gz SRR9_1.fastq.gz SRR1_2.fastq.gz SRR9_2.fastq.gz SRR2_1.fastq.gz SRR10_1.fastq.gz SRR2_2.fastq.gz SRR10_2.fastq.gz SRR3_1.fastq.gz SRR3_2.fastq.gz SRR4_1.fastq.gz SRR4_2.fastq.gz SRR5_1.fastq.gz SRR5_2.fastq.gz SRR6_1.fastq.gz SRR6_2.fastq.gz SRR7_1.fastq.gz SRR7_2.fastq.gz rnaseq_ballgown.R SRR8_1.fastq.gz rnaseq_pipeline.config.sh SRR8_2.fastq.gz

Can you please let me know how I can run the code to align these fastq files with a Human gene reference ?

I did not change anything on any of the script which I believe that I must change the one in rnaseq_pipline.config.sh

At this moment, it is

NUMCPUS=8
HISAT2=$(which hisat2)
STRINGTIE=$(which stringtie)
SAMTOOLS=$(which samtools)
BASEDIR=$(pwd -P)/chrX_data
FASTQLOC="$BASEDIR/samples"
GENOMEIDX="$BASEDIR/indexes/chrX_tran"
GTFFILE="$BASEDIR/genes/chrX.gtf"
PHENODATA="$BASEDIR/geuvadis_phenodata.csv"
TEMPLOC="./tmp" #this will be relative to the output directory
reads1=(${FASTQLOC}/*_1.*)
reads1=("${reads1[@]##*/}")
reads2=("${reads1[@]/_1./_2.}")

All what I want is to get the htseq-count. txt files for each sample

based on this post Run mutltiple Fastq files on Hisat2 If I do the following

for f in `ls -1 *_1.fastq.gz | sed 's/_1.fastq.gz//' `
do
echo hisat2 -x mm10idx -1 ${f}_1.fastq.gz -2 ${f}_2.fastq.gz -S ${f}.bam
done

It will generate all the bam files but all empty. if I echo the do , it will be the output as an example

hisat2 -x mm10idx -1 SRR1_1.fastq.gz -2 SRR1_2.fastq.gz -S SRR1.bam
hisat2 -x mm10idx -1 SRR2_1.fastq.gz -2 SRR42_2.fastq.gz -S SRR2.bam
hisat2 -x mm10idx -1 SRR3_1.fastq.gz -2 SRR3_2.fastq.gz -S SRR3.bam
hisat2 -x mm10idx -1 SRR4_1.fastq.gz -2 SRR_2.fastq.gz -S SRR4.bam
hisat2 -x mm10idx -1 SRR5_1.fastq.gz -2 SRR5_2.fastq.gz -S SRR5.bam
hisat2 -x mm10idx -1 SRR6_1.fastq.gz -2 SRR6_2.fastq.gz -S SRR6.bam
hisat2 -x mm10idx -1 SRR7_1.fastq.gz -2 SRR7_2.fastq.gz -S SRR7.bam
hisat2 -x mm10idx -1 SRR8_1.fastq.gz -2 SRR8_2.fastq.gz -S SRR8.bam
hisat2 -x mm10idx -1 SRR9_1.fastq.gz -2 SRR9_2.fastq.gz -S SRR9.bam
hisat2 -x mm10idx -1 SRR10_1.fastq.gz -2 SRR10_2.fastq.gz -S SRR10.bam

On the other hand, where should I involve the reference genome? for example, I downloaded the hg19 from here https://ccb.jhu.edu/software/hisat2/index.shtml

RNA-Seq • 6.1k views

ADD COMMENT • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

Maybe DEWE (Differential Workflow Executor for RNA-Seq) can be helpful to you: http://www.sing-group.org/dewe It provides a user friendly GUI to run HISAT2, convert aligments to bam and run htseq-count. Regards.

ADD REPLY • link 6.8 years ago by Hugo ▴ 380

0

Entering edit mode

Better to use Galaxy, it's widely used and provides most of the common tools.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan can you please tell me how to use galaxy for it? I know how to upload the data and how to run the hisat but some parameters I don't know. For example, single end or paired reads? I guess pair read? or input read as individual or paird? and the rest default values. but the question is how then get the htseq-count? because when I try to use htseq-count function, it needs gff file which I don't have

ADD REPLY • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

You have paired-end data, so you should select paired end. Run featureCounts on the resulting BAM files. It's in Galaxy too, as is DESeq2. In fact, there's training material available online (it's what we made to use when we train people here in Freiburg).

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan thank you . actually many people think of htseq-count rather than feature counts. I do't know which one is the best but I will try to follow your treating protocol and contact you if needed. Thank you again I believe Galaxy is amazing and easier to work with

ADD REPLY • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

@Devon Ryan can you please tell me how to get this gff for human ? can I use http://genome.ucsc.edu/cgi-bin/hgTables and then send it to galaxy or do you know a better way?

ADD REPLY • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

Never use GFF files or annotations from UCSC. Download a GTF file from Gencode or Ensembl.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan can you please give me a link to the right gft? why would UCSC be a problem?

ADD REPLY • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

Here is the link to the Gencode, release 27 primary assembly GTF. UCSC annotations have historically both lagged behind those of Ensembl/Gencode and been of lower quality. They have also historically had a variety of odd issues, such as having multiple instances of the same gene on multiple strands/chromosomes all sharing the same ID...which makes no biological sense (they can have the same name, but they need different IDs since they're different loci). Ensembl/Gencode GTFs won't break tools, UCSC annotations have been known to do exactly that.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

Devon Ryan · Accepted Answer · 2018-02-13

2

Entering edit mode

6.8 years ago

Devon Ryan 104k

You're trying to do too many things at once, which is just going to cause confusion for someone new to the field. Start out with a single sample that manually type the command for. Once that works properly you can move on to multiple samples in a loop.

Since you apparently downloaded the hg19 prebuilt reference, you need to specify where it is with the -x option. In your examples, that was -x mm10idx, but presumably something like -x /some/path/to/the/downloaded/files/hg19 would make more sense.

ADD COMMENT • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

@Devon Ryan I am trying to run this, I will accept your answer once I get it to run

ADD REPLY • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

@Devon Ryan that helped me thanks although I had to do a lot of work to get it to run :-D

ADD REPLY • link 6.8 years ago by Learner ▴ 280

0

Entering edit mode

@Devon Ryan Hi Devo I tried to run this bash script:

SAMPLES="x1..x9"

for SAMPLE in $SAMPLES; do
    hisat2 -p 4 --dta -x ${hisat2Ref} -U fastq/${SAMPLE}.fastq.gz | samtools view -bh - | samtools sort - bam/${SAMPLE}.bam | samtools index bam/${hisat2Ref}.bam
done

But I didn't get it working do you could give one idea why it didn't worked? I not experienced in bash scripting. Thanks

ADD REPLY • link updated 5.4 years ago by Devon Ryan 104k • written 5.4 years ago by psschlogl ▴ 50

1

Entering edit mode

I strongly recommend that you not use bash scripting and instead use something more scalable and controllable like snakemake.