Hi, This is my very first experience analysing RNAseq data. My goal is to do differential analysis between two strains of a bacteria. So far, i managed to align and produce SAM and BAM files. I'm having problems to annotate and count my reads. Here are the commands that I used. My reads are from SOLID and hence in colourspace
$ nohup solid2fastq.pl 291_01_01 291_01_01-bwa #Convert .csfasta and .qual to .fastq
$ nohup bwa index -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta
$ nohup bwa aln -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 291_01_01-bwa.singleF3.fastq 291_01_01-bwa.sai
$ perl -ne 'if($_ !~ m/^\S+?\t4\t/){print $_}' 291_01_01-bwa.sam > 291_01_01-bwa.sam.filtered #Convert to SAM file
$ samtools sort 291_01_01-bwa.bam 291_01_01-bwa.bam.sorted
$ samtools index 291_01_01-bwa.bam.sorted.bam
to produce .rpkm file
$ java -jar ~/bin/bam2rpkm-0.06/bam2rpkm-0.06.jar -i 291_01_01-bwa.bam.sorted.bam -f Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.RPKM2.out # i get an error here
$ERROR: Problem encountered whilst reading gtf file. Could not interpret line 'GeneDB|Tb427_01_v4 EuPathDB supercontig 1
so i tried different method to count
$ htseq-count -i ID 291_01_01-bwa.sam Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.sam_htseq-count #still error
$Error occured when processing GFF file (line 37060 of file Tbrucei427_TriTrypDB-4.0.gff):
need more than 1 value to unpack
and different method
$ python make_bed_from_fasta.py ~/Downloads/reference/TbruceiTreu927AnnotatedCDS_TriTrypDB-4.0.fasta > 927_reference.bed #this python script converts .fasta into .bed file since the .gff file cannot be processed
$multiBamCov -q 30 -p -bams 291_01_01-bwa.bam.sorted.bam -bed 927_reference.bed > test_counts.txt
now I only get 0 counts for all genes. Does this mean that there is something wrong with my alignment files or something wrong with the counting method . And it seems like my .gff (version 3) file was unable to be read by htseq-count and also the java script . I downloaded the gff file from GeneDB and it seems like in many tutorials .gtf files are used instead. So I'm stuck at counting the read part and I really need some help . Help please .
If that is the complete line, then something is missing here, a GFF file has 9 columns. Maybe the rest got cut off while pasting into this page? In any case, features of type supercontig or contig are not normally necessary, it should work with having only transcript, exon and gene features.
Also, what is in line 37060?
Miss
>
here?And can you share the url to where you download the gff file? So people can help you.
It would be helpful if you posted a few lines of the GFF file surrounding the line that's causing errors. It's possible that the line is simply incorrect.
This is snaphot of the .gff file.
First error:
I have read somewhere that I need to specify the id name and I redo my htseq-count using:
The second error :
This is where the fasta sequence starts. I compared with other .gtf files and they don't have the fasta sequence so I removed the whole fasta sequence.
The gff file now is able to be proceessed but I'm getting all 0s for my reads.
This is where i get my .gff file http://tritrypdb.org/common/downloads/Current_Release/TbruceiTREU927/gff/data/
Try removing the Fasta portion of the file. That will likely fix the problem (at least it did for me). While including fasta in a GFF file is technically allowed, a lot of things will break if that's done.
Edit: I should add that I'm apparently using version 7 of that annotation file, rather than version 4 (I just tried the one you linked to). However, trying the older version that you used produced the error you received which was cured by, again, removing the fasta section.
Hi, I managet to make the .gff to work by removing the fasta section . However, when i run htseq-count, I only get 0s for my counts
See my reply to your reply to Michael, below.
Hi,
Thank you for the suggestions.
I am not sure how or where should I look at to check if my sam file is compatible with the .gff file.
Below is my SAM file that i get using BOWTIE
I tried to use cufflinks as you suggested using this command:
it returned me two files :
gene_fpkm_tracking
and
isoforms.fpkm_tracking
Again I'm not sure what am I supposed to look for and in these files to see whether the inputs files I provided are correct . I would really appreciate if you could help
The excerpt from the sam file shows only unmapped reads (Flag 4), the chromosome name would be in column 3 of the SAM file, but the header
@SQ SN:2 LN:14562717
reveals that reference sequence name2
was used. As I suspect, there is a name mismatch between the reference FASTA file and the genome annotation file. Please see my answer below.That's not compatible. Any mapped reads would be mapped to a contig called
2
, but that's not the name of anything in the GFF file.I'm sorry but I could not understand what does it mean. I have zero background in analysing this sort of data and I'm very new in programming.
Could you suggest what can I do to fix this problem?
If you look at your fasta file, you will most likely see header lines starting with ">". There are possibly headers that contain ids like ">2", however the genome annotation in gff contains different ids. The result is that the alignment works, but the counting couldn't determine, where a gene is in relation to the reads, as it cannot know that "2" means the same as GeneDB|Tb427.01.90
If that is the case, however it is odd that cufflinks gives you non-zero FPKM for those genes!