Question

Problem With Counting Mapped Reads

2

Entering edit mode

10.7 years ago

njs531 ▴ 20

Hi, This is my very first experience analysing RNAseq data. My goal is to do differential analysis between two strains of a bacteria. So far, i managed to align and produce SAM and BAM files. I'm having problems to annotate and count my reads. Here are the commands that I used. My reads are from SOLID and hence in colourspace

$ nohup solid2fastq.pl 291_01_01 291_01_01-bwa  #Convert .csfasta and .qual to .fastq

$ nohup bwa index -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 

$ nohup bwa aln -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 291_01_01-bwa.singleF3.fastq 291_01_01-bwa.sai 

$ perl -ne 'if($_ !~ m/^\S+?\t4\t/){print $_}' 291_01_01-bwa.sam > 291_01_01-bwa.sam.filtered #Convert to SAM file

$ samtools sort 291_01_01-bwa.bam 291_01_01-bwa.bam.sorted

$ samtools index 291_01_01-bwa.bam.sorted.bam

to produce .rpkm file

$ java -jar ~/bin/bam2rpkm-0.06/bam2rpkm-0.06.jar  -i 291_01_01-bwa.bam.sorted.bam -f Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.RPKM2.out  # i get an error here
$ERROR: Problem encountered whilst reading gtf file. Could not interpret line 'GeneDB|Tb427_01_v4 EuPathDB supercontig 1

so i tried different method to count

$ htseq-count -i ID 291_01_01-bwa.sam Tbrucei427_TriTrypDB-4.0.gff > 291_01_01-bwa.sam_htseq-count #still error 
$Error occured when processing GFF file (line 37060 of file Tbrucei427_TriTrypDB-4.0.gff):

need more than 1 value to unpack

and different method

$ python make_bed_from_fasta.py ~/Downloads/reference/TbruceiTreu927AnnotatedCDS_TriTrypDB-4.0.fasta > 927_reference.bed #this python script converts .fasta into .bed file since the .gff file cannot be processed
$multiBamCov -q 30 -p -bams 291_01_01-bwa.bam.sorted.bam -bed 927_reference.bed > test_counts.txt

now I only get 0 counts for all genes. Does this mean that there is something wrong with my alignment files or something wrong with the counting method . And it seems like my .gff (version 3) file was unable to be read by htseq-count and also the java script . I downloaded the gff file from GeneDB and it seems like in many tutorials .gtf files are used instead. So I'm stuck at counting the read part and I really need some help . Help please .

rnaseq solid bedtools java • 7.0k views

ADD COMMENT • link updated 10.7 years ago by Michael 55k • written 10.7 years ago by njs531 ▴ 20

1

Entering edit mode

   GeneDB|Tb427_01_v4 EuPathDB supercontig 1

If that is the complete line, then something is missing here, a GFF file has 9 columns. Maybe the rest got cut off while pasting into this page? In any case, features of type supercontig or contig are not normally necessary, it should work with having only transcript, exon and gene features.

Also, what is in line 37060?

ADD REPLY • link 10.7 years ago by Michael 55k

0

Entering edit mode

nohup bwa aln -c TbruceiTreu927Genomic_TriTrypDB-4.0.fasta 291_01_01-bwa.singleF3.fastq 291_01_01-bwa.sai

Miss > here?

And can you share the url to where you download the gff file? So people can help you.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Xingyu Yang ▴ 280

0

Entering edit mode

It would be helpful if you posted a few lines of the GFF file surrounding the line that's causing errors. It's possible that the line is simply incorrect.

ADD REPLY • link 10.7 years ago by Devon Ryan 104k

0

Entering edit mode

This is snaphot of the .gff file.

##gff-version    3
##feature-ontology    so.obo
##attribute-ontology    gff3_attributes.obo
##sequence-region    11L3_v3    1    63482
##sequence-region    5K5_v3    1    158740
##sequence-region    BAC26D11    1    142770
##sequence-region    GTRYP_4071_4094-1a01.w2k4071    1    18466
##sequence-region    TP26M21-2a10.p1k    1    39798
##sequence-region    TP26P4-1a02.q1ka    1    16439
##sequence-region    TP26P4-3b07.q1k    1    56260
##sequence-region    TP26P4-5g04.p1k    1    13060
##sequence-region    TP26P4-6c09.p1k    1    25080
##sequence-region    TP29O12a-1b01.p1k    1    29117
##sequence-region    TP3F6-6f06.p1k    1    26266
##sequence-region    TP3F6-6h01.p1k    1    13990
##sequence-region    TPB1C9-1a08.q1k    1    11334
##sequence-region    GeneDB|Tb927_01_v4    1    1064672
##sequence-region    GeneDB|Tb927_02_v4    1    1193948
##sequence-region    GeneDB|Tb927_03_v4    1    1653225
##sequence-region    GeneDB|Tb927_04_v4    1    1590432
##sequence-region    GeneDB|Tb927_05_v4    1    1608198
##sequence-region    GeneDB|Tb927_06_v4    1    1618915
##sequence-region    GeneDB|Tb927_07_v4    1    2205233
##sequence-region    GeneDB|Tb927_08_v4    1    2481190
##sequence-region    GeneDB|Tb927_09_v4    1    3057547
##sequence-region    GeneDB|Tb927_10_v5    1    4144375
##sequence-region    GeneDB|Tb927_11_01_v4    1    4977082
.
.
.
.
.
GeneDB|Tb927_11_01_v4    EuPathDB    CDS    3572275    3572688    .    -    0    ID=cds_Tb11.01.5195-1;Name=cds;description=.;size=561;Parent=rna_Tb11.01.5195-1
GeneDB|Tb927_11_01_v4    EuPathDB    exon    3572275    3572835    .    -    .    ID=exon_Tb11.01.5195-1;Name=exon;description=exon;size=561;Parent=rna_Tb11.01.5195-1
GeneDB|Tb927_11_01_v4    EuPathDB    gene    3573190    3573720    .    -    .    ID=GeneDB|Tb11.01.5200;Name=Tb11.01.5200;description=hypothetical+protein%2C+conserved;size=531;web_id=Tb11.01.5200;locus_tag=Tb11.01.5200;size=531;Alias=Tb11.01.5200,20660476,70834804,Q382H5,3665126,EAN80306.1,74025704,XP_829418.1
GeneDB|Tb927_11_01_v4    EuPathDB    mRNA    3573190    3573720    .    -    .    ID=rna_Tb11.01.5200-1;Name=Tb11.01.5200-1;description=Tb11.01.5200-1;size=531;Parent=GeneDB|Tb11.01.5200;Ontology_term=GO:0008150,GO:0005739,GO:0044429,GO:0003674;Dbxref=EuPathDB:Tb11.01.5200,NCBI_gi:70834804,NCBI_gi:74025704,taxon:185431
GeneDB|Tb927_11_01_v4    EuPathDB    CDS    3573190    3573621    .    -    0    ID=cds_Tb11.01.5200-1;Name=cds;description=.;size=531;Parent=rna_Tb11.01.5200-1
GeneDB|Tb927_11_01_v4    EuPathDB    exon    3573190    3573720    .    -    .    ID=exon_Tb11.01.5200-1;Name=exon;description=exon;size=531;Parent=rna_Tb11.01.5200-1
GeneDB|Tb927_11_01_v4    EuPathDB    gene    3575284    3575826    .    -    .    ID=GeneDB|Tb11.01.5210;Name=Tb11.01.5210;description=hypothetical+protein%2C+conserved;size=543;web_id=Tb11.01.5210;locus_tag=Tb11.01.5210;size=543;Alias=EAN80307.1,XP_829419.1,Tb11.01.5210,74025706,Q382H4,70834805,3665127
.
.
.
.
tryp_XI-1036e06.p1k    EuPathDB    exon    22123    22200    .    +    .    ID=exon_tryp_XI-1036e06.p1k.snoRNA.0004-1;Name=exon;description=exon;size=78;Parent=rna_tryp_XI-1036e06.p1k.snoRNA.0004-1
##FASTA
>GeneDB|TB927.5.300b
ATGGCTCACGGCTCGATTCCAGTTATTGATGTCGGCCCTCTGTTCTGTGATGGAGAAAAG
GGGATGATGGATGTTGCGAAACAGATTGATCATGCCTGTAGGACGTGGGGTGTTTTTCTT
.
.
>cds_TB927.5.300b-1
MAHGSIPVIDVGPLFCDGEKGMMDVAKQIDHACRTWGVFLVVGHPIPRERTEKLMEMAKA
FFSLPLEEKLKVDIRKSKHHRGYGCLDAENVDPTKPFDCKETFNMGCHLPEDHPDVAAGK
.
.

First error:

ERROR: Problem encountered whilst reading gtf file. Could not interpret line 'GeneDB|Tb427_01_v4 EuPathDB supercontig 1 1064569 . + . ID=GeneDB|Tb427_01_v4;Name=Tb427_01_v4;description=Tb427_01_v4;size=1064569;web_id=Tb427_01_v4;molecule_type=dsDNA;organism_name=Trypanosoma brucei;translation_table=11;topology=linear;localization=nuclear;Dbxref=EuPathDB:Tb427_01_v4,taxon:1854310'

I have read somewhere that I need to specify the id name and I redo my htseq-count using:

$htseq-count -i ID 291_01_01_927_CDS.sam 927_reference.gff > testing_count.txt

The second error :

Error occured when processing GFF file (line 37060 of file Tbrucei427_TriTrypDB-4.0.gff): need more than 1 value to unpack [Exception type: ValueError, raised in __init__.py:220].

This is where the fasta sequence starts. I compared with other .gtf files and they don't have the fasta sequence so I removed the whole fasta sequence.

The gff file now is able to be proceessed but I'm getting all 0s for my reads.

This is where i get my .gff file http://tritrypdb.org/common/downloads/Current_Release/TbruceiTREU927/gff/data/

ADD REPLY • link updated 10.7 years ago by Devon Ryan 104k • written 10.7 years ago by njs531 ▴ 20

0

Entering edit mode

Try removing the Fasta portion of the file. That will likely fix the problem (at least it did for me). While including fasta in a GFF file is technically allowed, a lot of things will break if that's done.

Edit: I should add that I'm apparently using version 7 of that annotation file, rather than version 4 (I just tried the one you linked to). However, trying the older version that you used produced the error you received which was cured by, again, removing the fasta section.

ADD REPLY • link 10.7 years ago by Devon Ryan 104k

0

Entering edit mode

Hi, I managet to make the .gff to work by removing the fasta section . However, when i run htseq-count, I only get 0s for my counts

ADD REPLY • link 10.7 years ago by njs531 ▴ 20

0

Entering edit mode

See my reply to your reply to Michael, below.

ADD REPLY • link 10.7 years ago by Devon Ryan 104k

0

Entering edit mode

Do you get FPKMs with cufflinks? Not a solution for counts, but to check if all 0 counts are sane, I believe not.
Check that the landmark names in column (1) are identical to those in the SAM file, possibly that's why nothing is counted.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Michael 55k

0

Entering edit mode

Hi,

Thank you for the suggestions.

I am not sure how or where should I look at to check if my sam file is compatible with the .gff file.

Below is my SAM file that i get using BOWTIE

I tried to use cufflinks as you suggested using this command:

cufflinks --GTF 927_reference.gff 427.291_01_01.sam.sorted.bam

it returned me two files :

gene_fpkm_tracking

tracking_id    class_code    nearest_ref_id    gene_id    gene_short_name    tss_id    locus    length    coverage    FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
GeneDB|Tb427.01.90    -    -    GeneDB|Tb427.01.90    Tb427.01.90    -    GeneDB|Tb427_01_v4:50419-52198    -    -    22.5199    17.3532    25.3529    OK
GeneDB|Tb427.01.100    -    -    GeneDB|Tb427.01.100    Tb427.01.100    -    GeneDB|Tb427_01_v4:57395-59181    -    -    161.094    135.952    159.489    OK
GeneDB|Tb427.01.110    -    -    GeneDB|Tb427.01.110    Tb427.01.110    -    GeneDB|Tb427_01_v4:59855-60182    -    -    128.126    39.5039    70.3035    OK
GeneDB|Tb427.01.120    -    -    GeneDB|Tb427.01.120    Tb427.01.120    -    GeneDB|Tb427_01_v4:60253-62833    -    -    125.967    111    128.991    OK
GeneDB|Tb427.01.180    -    -    GeneDB|Tb427.01.180    Tb427.01.180    -    GeneDB|Tb427_01_v4:69904-72406    -    -    58.3088    50.0546    61.2557    OK
GeneDB|Tb427.01.190    -    -    GeneDB|Tb427.01.190    Tb427.01.190    -    GeneDB|Tb427_01_v4:75426-77919    -    -    30.1299    25.1177    32.4949    OK
GeneDB|Tb427.01.220    -    -    GeneDB|Tb427.01.220    Tb427.01.220    -    GeneDB|Tb427_01_v4:85857-88359    -    -    52.4017    44.7166    55.6551    OK
GeneDB|Tb427.01.240    -    -    GeneDB|Tb427.01.240    Tb427.01.240    -    GeneDB|Tb427_01_v4:93077-99008    -    -    5.69631    4.76209    6.64477    OK
GeneDB|Tb427.01.270    -    -    GeneDB|Tb427.01.270    Tb427.01.270    -    GeneDB|Tb427_01_v4:106666-107041    -    -    54.9609    18.6833    45.5406    OK

and

isoforms.fpkm_tracking

tracking_id    class_code    nearest_ref_id    gene_id    gene_short_name    tss_id    locus    length    coverage    FPKM    FPKM_conf_lo    FPKM_conf_hi    FPKM_status
rna_Tb427.01.90-1    -    -    GeneDB|Tb427.01.90    Tb427.01.90    -    GeneDB|Tb427_01_v4:50419-52198    1779    4.9371    22.5199    17.3532    25.3529    OK
rna_Tb427.01.100-1    -    -    GeneDB|Tb427.01.100    Tb427.01.100    -    GeneDB|Tb427_01_v4:57395-59181    1786    35.317    161.094    135.952    159.489    OK
rna_Tb427.01.110-1    -    -    GeneDB|Tb427.01.110    Tb427.01.110    -    GeneDB|Tb427_01_v4:59855-60182    327    28.0893    128.126    39.5039    70.3035    OK
rna_Tb427.01.120-1    -    -    GeneDB|Tb427.01.120    Tb427.01.120    -    GeneDB|Tb427_01_v4:60253-62833    2580    27.616    125.967    111    128.991    OK
rna_Tb427.01.180-1    -    -    GeneDB|Tb427.01.180    Tb427.01.180    -    GeneDB|Tb427_01_v4:69904-72406    2502    12.7832    58.3088    50.0546    61.2557    OK
rna_Tb427.01.190-1    -    -    GeneDB|Tb427.01.190    Tb427.01.190    -    GeneDB|Tb427_01_v4:75426-77919    2493    6.60547    30.1299    25.1177    32.4949    OK
rna_Tb427.01.220-1    -    -    GeneDB|Tb427.01.220    Tb427.01.220    -    GeneDB|Tb427_01_v4:85857-88359    2502    11.4882    52.4017    44.7166    55.6551    OK
rna_Tb427.01.240-1    -    -    GeneDB|Tb427.01.240    Tb427.01.240    -    GeneDB|Tb427_01_v4:93077-99008    5931    1.24882    5.69631    4.76209    6.64477    OK
rna_Tb427.01.270-1    -    -    GeneDB|Tb427.01.270    Tb427.01.270    -    GeneDB|Tb427_01_v4:106666-107041    375    12.0492    54.9609    18.6833    45.5406    OK

Again I'm not sure what am I supposed to look for and in these files to see whether the inputs files I provided are correct . I would really appreciate if you could help

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by njs531 ▴ 20

1

Entering edit mode

The excerpt from the sam file shows only unmapped reads (Flag 4), the chromosome name would be in column 3 of the SAM file, but the header @SQ SN:2 LN:14562717 reveals that reference sequence name 2 was used. As I suspect, there is a name mismatch between the reference FASTA file and the genome annotation file. Please see my answer below.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Michael 55k

0

Entering edit mode

That's not compatible. Any mapped reads would be mapped to a contig called 2, but that's not the name of anything in the GFF file.

ADD REPLY • link 10.7 years ago by Devon Ryan 104k

0

Entering edit mode

I'm sorry but I could not understand what does it mean. I have zero background in analysing this sort of data and I'm very new in programming.

Could you suggest what can I do to fix this problem?

ADD REPLY • link 10.7 years ago by njs531 ▴ 20

1

Entering edit mode

If you look at your fasta file, you will most likely see header lines starting with ">". There are possibly headers that contain ids like ">2", however the genome annotation in gff contains different ids. The result is that the alignment works, but the counting couldn't determine, where a gene is in relation to the reads, as it cannot know that "2" means the same as GeneDB|Tb427.01.90

If that is the case, however it is odd that cufflinks gives you non-zero FPKM for those genes!

ADD REPLY • link 10.7 years ago by Michael 55k

Ram · Answer 1 · 2014-03-24

0

Entering edit mode

10.7 years ago

Michael 55k

There is possibly a mismatch between the names in the reference FASTA file: TbruceiTreu927Genomic_TriTrypDB-4.0.fasta reference assembly and the genome annotation file: in the GFF file. You need to verify this by looking at the fasta file and its ID lines, they need to match, the fasta file should contain headers such as: GeneDB|Tb927_01_v4 as indicated in the headers ##sequence-region in the gff file. The best way to achieve this is possibly to download a new fasta file from GenBank, where did you get yours from? Note that you have to repeat the whole process starting with bwa index after downloading the correct annotation file. You need to post your reference fasta file as well, to be sure.

ADD COMMENT • link 10.7 years ago by Michael 55k

0

Entering edit mode

Hi, I got the .fasta genome from my labmate . I downloaded new fasta and gff files from the same website http://tritrypdb.org/common/downloads/Current_Release/TbruceiTREU927/fasta/data/

Here's how the fasta looks like

>11L3_v3 | organism=Trypanosoma_brucei_TREU927 | version=2013-08-28 | length=63482 | SO=contig
AAAACTGGCACGATACAACATCCCACTTTCCGCCCATCATGGCACTGTTTCAGGAACTCC
TCAAGCTTGGTGACCAAAGAACCCTTCTCATCGAACACATCCAACACACAGAATTGGGCC
CGTATCCCAGGAGGGTTCTTGTCGTGAGGAACAACCTTTATACTTAATCCCTTGGTAATG
GAGTAGCGTGCAGCAGCCTCCGCCACCGCCTCCAAGCAACAGTCCGTCATGAACAGCAGC
TTCGGTTGGTGAGCCCTCCACTTGTGAAGGTCCCCACCGCGACCCGAACACAGGTCAGCC
ACCACCATACCATCGTTGTCAAACTCACCGCCCTTATTATTCGACCTCAGGTAAGAAACA
GAGGTGCTAATCAGTACGCCTTTTATCCAGTTGTTGTAATGCCGTAGAATGCTACGATTC
TTTCCCGTTGATAGTTCCTTCGTTTTGCAAGCGTAATGTGCATTTACTTGCTCCACTTGT
............
>tryp_IXb-277d03.p1c | organism=Trypanosoma_brucei_TREU927 | version=2013-08-28 | length=13947 | SO=contig
TGAAAATTGTTGAACAACTTAATGAGTAGTATTTGGTTTCATTTCCATTAATTTTGAATA
TATATTACCTTGTATTGCTCTATACACTTCCATTTCTTATTCTCTCATTAATAATACTTA
CCTGCTGAACTTTCATATATAATATACTTGAGTTATGTGTTAAGGCTTTCTCTCTGCTAC
CTTTCTCTAATTATATGGAGTTTGTATCCTTATATGTTTTTCCCCCTTAGCTGTATGCTT
ATATGCGTTGTGTATGTGTTTATGTTGTGCATTATTGTGAAGTGTGTAGATCCTTCTTGT
GCATATATTCTATATTGTCAGCTTTGTAATCACTTTATATATCCTTTCTTTTTATTTTTT
GATTAATAACCTTTTTCTTTTCTGTAACATGGTACGGACGGTAAGTGATGATTCGCATAT
CAGTAGTTATGTGAAGTGTCAATTTTTTCTTTGTTCGAGTATATGTATATGTATATATCG
TGCAGACGTAGCGTATTATGTGTTGATATGTTGTATTTGGATAAATCAGGCAGGTGTTAC
GGCCTCCATGTGTTTGTTTTATAATATGTTTTTGTTAATATTGATCGTAAGGGAATATGG
CGATACGATGTTACTTTCGTGTTGCTGTGTTATTATACTGAGGGTTACTGTGTCAACCCC
TTCATCACCTGTATATGAATAAATAAATAAATAAATGTATATATATATGTTCGCTTGTCA
AATTCACTATTTCTCATTTTCTTCTTATTAGTGTGTTAGACAAAAGGAAGGAATATATTT
ATGTTGTTACTTGTGTGTTGTGCATGTGTGTGCTGACTAACAATAGTGTTGCACTATATA
ATTCGTTTTGTCGTTGAGGATGTTTTGTACAAAAAAATAGTTTCATTTCAGTAATATGAT

and the gff file

##gff-version    3
##feature-ontology    so.obo
##attribute-ontology    gff3_attributes.obo
##sequence-region    11L3_v3    1    63482
##sequence-region    5K5_v5.1    1    158740
##sequence-region    GTRYP_4071_4094-1a01.w2k4071    1    18466
##sequence-region    TP26M21-2a10.p1k_v5.1    1    39798
##sequence-region    TP26P4-1a02.q1ka    1    16439
##sequence-region    TP26P4-3b07.q1k    1    56260
##sequence-region    TP26P4-5g04.p1k    1    13060
##sequence-region    TP26P4-6c09.p1k    1    25080
##sequence-region    TP29O12a-1b01.p1k    1    29117
##sequence-region    TP3F6-6f06.p1k    1    26266
##sequence-region    TP3F6-6h01.p1k    1    13990
##sequence-region    TPB1C9-1a08.q1k    1    11334
##sequence-region    Tb927_01_v5.1    1    1064672
##sequence-region    Tb927_02_v5.1    1    1193948
##sequence-region    Tb927_03_v5.1    1    1653225
##sequence-region    Tb927_04_v5.1    1    1590432
##sequence-region    Tb927_05_v5.1    1    1802303

I Iook at the fasta and the name starts something like >11L3_v3. and the ##sequence-region in gff is also 11L3_v3. Does this mean that they are compatible and I can use both file for mapping and annotating? Just to make sure that I have the correct fles before I start again. And which software do you suggest for aligning. I have BWA and BOWTIE installed. My reads are in colourspace so I have .csfasta and .qual read files.

Thank you so much for your help.

Off topic question. I'm new at this website. Do I need to rate or choose the best answer or something similar in this website?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by njs531 ▴ 20

1

Entering edit mode

Yeah, those should work fine together and you shouldn't run into these problems anymore (anyway, it's always nice to use the updated versions). For colorspace, either bowtie1 or older versions of bwa would be fine. Everyone has their own personal preference there.

It would be good if you uprated Michael's answer and then set it as being "accepted", which I think involves clicking on a check-mark, since it ended up being much of the root cause of things (in addition to htseq-count not handling GFF files with fasta sections).

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.7 years ago by Devon Ryan 104k