Dear all, I use the data in ENCODE Reference genome and want to calculate the expression level for every exons to do the research about alternative splicing, but during the data processing for polyA plus RNA-seq, I met many problems and I guess the major problems is that I use the wrong gtf or fasta files so I checked the ENCODE bulk-RNA seq pipeline and try their ref but still don't work, the data I chose and methods are as follows: Data:ENCODE A569 cell lines polyA plus rnaseq, the link is https://www.encodeproject.org/experiments/ENCSR000CON/, to save time I chose the ENCFF123MIB processed bam data, for it can not be directly used for samtools index, I first sort it and the commands are as follows:
samtools sort -@ 8 ENCFF123MIB.bam > ENCFF123MIB.sort.bam
samtools index -@ 8 ENCFF123MIB.sort.bam
htseq-count -i gene_id -i exon_number --additional-attr gene_name -t exon --nonunique all ENCFF123MIB.bam gencode.v29.primary_assembly.annotation_UCSC_names.gtf >count_exon.tsv
And unfortunately these commands will end at at the time 2619444 GFF lines processed and then do not work.
So I think maybe I was wrong from the start so I download the raw data ENCFF000EJV.fastq and ENCFF000EJJ.fastq(two reads for one replicate) and start from the alignment, and the command I used are:
hisat2_extract_exons.py ../../gencoderef/gencode.v41.annotation.gtf > gencode.v41.annotation.exon
hisat2_extract_splice_sites.py ../../gencoderef/gencode.v41.annotation.gtf > gencode.v41.annotation.ss
hisat2-build -p 6 --exon gencode.v41.annotation.exon --ss gencode.v41.annotation.ss gencode.v41.transcripts.fa index.gtf > hisat2_build.log 2>&1 &
hisat2 -p 8 -x index.gtf -1 ENCFF000EJJ.fastq -2 ENCFF000EJV.fastq -S ./A549_rep_1.sam > hisat2_rep1_sam.log 2>&1 &
And then I got the sam files but the accuracy is not so good and I use the same way for exon level analysis using htseq and samtools, but the results were so bad. The accuracy are as follows 113588758 reads; of these: 113588758 (100.00%) were paired; of these: 37999564 (33.45%) aligned concordantly 0 times 21929846 (19.31%) aligned concordantly exactly 1 time
53659348 (47.24%) aligned concordantly >1 times
----
37999564 pairs aligned concordantly 0 times; of these:
102625 (0.27%) aligned discordantly 1 time
----
37896939 pairs aligned 0 times concordantly or discordantly; of these:
75793878 mates make up the pairs; of these:
63955134 (84.38%) aligned 0 times
3339152 (4.41%) aligned exactly 1 time
8499592 (11.21%) aligned >1 times
71.85% overall alignment rate
and the rusults for exon expression are Warning: xxxxx reads with missing mate encountered. almost most of the reads are missing. __no_feature 41053418 __ambiguous 0 __too_low_aQual 109403140 __not_aligned 19140617 __alignment_not_unique 106185767 I choose many data to do but every time I failed and want some help.