Hey guys,
I am new to RNA-Seq. I am trying to map some human tissue (contains bacteria) reads to a bacterial genome. I have used bowtie2 to filter all the remaining human reads, and mapped to a reference genome using:
bowtie2 -x /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/Indexes/Fusobacterium_ATCC25586 -1 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170861.fastq -2 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170862.fastq -S /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Reads_mapped_to_Fuso/FusoSRR317086.sam -p 3
And if I do have some reads mapped (using samtools flagstat) :
3615588 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
32849 + 0 mapped (0.91% : N/A)
3615588 + 0 paired in sequencing
1807794 + 0 read1
1807794 + 0 read2
654 + 0 properly paired (0.02% : N/A)
15826 + 0 with itself and mate mapped
17023 + 0 singletons (0.47% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
But when I used HTSeq to count reads, mots of the reads are un-aligned:
(python -m HTSeq.scripts.count -s no -i ID /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Reads_mapped_to_Fuso/sorted.sam /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/Reference_genome_and_gff/ATCC_25586.gff)
4174 GFF lines processed.
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Warning: Malformed SAM line: MRNM == '=' although read is not aligned.
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
100000 SAM alignment record pairs processed.
200000 SAM alignment record pairs processed.
300000 SAM alignment record pairs processed.
400000 SAM alignment record pairs processed.
...
...
id10 0
id11 0
id12 0
id13 0
id14 0
id15 0
id16 0
id17 0
id18 0
id19 0
...
...
__no_feature 13
__ambiguous 0
__too_low_aQual 24922
__not_aligned 1782858
__alignment_not_unique 0
The first few lines of the gff file: ##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM732v1
#!genome-build-accession NCBI_Assembly:GCF_000007325.1
##sequence-region NC_003454.1 1 2174500
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=190304
NC_003454.1 RefSeq region 1 2174500 . + . ID=id0;Dbxref=ATCC:25586,taxon:190304;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=ATCC 25586;sub-species=nucleatum
NC_003454.1 RefSeq gene 77 709 . - . ID=gene0;Dbxref=GeneID:992632;Name=FN1496;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1496
NC_003454.1 RefSeq CDS 77 709 . - 0 ID=cds0;Parent=gene0;Dbxref=Genbank:NP_602323.1,GeneID:992632;Name=NP_602323.1;gbkey=CDS;product=spore coat protein regulator protein YlbO;protein_id=NP_602323.1;transl_table=11
NC_003454.1 RefSeq gene 1104 2228 . - . ID=gene1;Dbxref=GeneID:992723;Name=FN1497;gbkey=Gene;gene_biotype=protein_coding;locus_tag=FN1497
NC_003454.1 RefSeq CDS 1104 2228 . - 0 ID=cds1;Parent=gene1;Dbxref=Genbank:NP_602324.1,GeneID:992723;Name=NP_602324.1;gbkey=CDS;product=multidrug resistance protein 2;protein_id=NP_602324.1;transl_table=11
NC_003454.1 RefSeq gene 2703 3602 . + .
First few lines of sam file :
@HD VN:1.0 SO:queryname
@SQ SN:NC_003454.1 LN:2174500
@PG ID:bowtie2 PN:bowtie2 VN:2.2.9 CL:"/Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -x /Users/winterlab/Desktop/Sequencing/RNA_seq_pipeline/bowtie2-2.2.9/Indexes/Fusobacterium_ATCC25586 -S /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Reads_mapped_to_Fuso/FusoSRR317089.sam -p 3 -1 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170891.fastq -2 /Users/winterlab/Desktop/Sequencing/Fuso_RNA-seq_raw-data/Filtered_reads/SRR3170892.fastq"
SRR317089.45 77 * 0 0 * * 0 0 CTTGGCTGTGGTTTCGCTGG BCC?B@CC=@C@CCC@@@@@ YT:Z:UP
SRR317089.45 141 * 0 0 * * 0 0 AGGGGAATCCGACTGTTTAATTAAAACA @@@@@=?@0@@@@@@@@@=@@@@@@@@@ YT:Z:UP
SRR317089.48 77 * 0 0 * * 0 0 AACGATCAGAGTAGTGGTATTTCACCGGCGGCCCGCAGGGCCG CCCAACA>ACCCBCCC@8@CB?<>BBB?@BBB@BBABBB<-?@ YT:Z:UP
SRR317089.48 141 * 0 0 * * 0 0 CCCTACCTACTATCCAGCGAAACCACAGCCAAGGNAACGGGATTGGCGGAATCAGCGGG BBBB>B*BB><@B>@BBBBBBBB@BB=@BB0/00!68688B+B7<@@:@77;@3B6@>@ YT:Z:UP
SRR317089.49 77 * 0 0 * * 0 0 AAGAAACTAACCAGGATTCCCTCAGTAACGGCGAGTGCACAGGGAAGAGCCCCGCGCC @BB:BBB?*AAA?AA?:AAAAA?AA>B:BBAA?5A,<+;7-772+95499?;88?9?? YT:Z:UP
SRR317089.49 141 * 0 0 * * 0 0 CGTGCCGGCATTTAGCCTTAGATGGAGTTTACCACCCGCTTTGGGCTGCATTCCCAAGCAACCC ??B@?AAA3ABBBA<BB@BAA2A<:BBBBB8???;B=BB<*8B87A=?A68/?7?-/.><7?<? YT:Z:UP
SRR317089.56 77 * 0 0 * * 0 0 CATCGTTTATGGTCGGAACTACGACGGTATCTGATCGTCTTCGAACCTC @@CC@=CC?C7789<::::2>>9+9B?1??BAA9=AA5A<2?B?5AA3A YT:Z:UP
SRR317089.56 141 * 0 0 * * 0 0 CGCAGCTAGGAATAATGGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATGAAGAG BAAB<A??<B??9<B6699>BA6@BA<A6A?=BA<<?B85??2988<8(9>>89;40??19???86:+3/?3? YT:Z:UP
SRR317089.58 77 * 0 0 * * 0 0 GAACAAGATGCAGAACCATGGCTATGAGAACCCCACCTACAAATACCTGGAGCAG AA>A?8>>.>47>09A8B?;8**<<9+.95:84?86(379;/:A?6?3?3'>7=@ YT:Z:UP
So I made sure the chromosome names are matching and everything so I cant figure out why.
Any input is highly appreciated:)
< 1% reads mapped should be the first cause for concern.
Your filter step appears to have either failed or removed the bacterial reads as well. Look into a read-binning strategy using BBsplit and the original data files.
Thanks for all the inputs guys! I appreciate it! I did what genomax2 and WouterDeCoster suggested, and skipped the "filtering human reads" step but it seems that the result is still pretty much the same:
As for the low mapped reads, it is kind of expected, because I am mapping human tissue reads to one single bacterial genome (there are a lot other bacteria present there). With that being said, I should have 2088270*0.96% overall alignment rate=20047 reads. That is the case when I searched for "NC_003454.1" in the sam file. I am so confused...
Is the chromosome ID (should be only one or two if there is a plasmid) matching in your sam file and the annotation file you are using?
Thanks for the comments, genomax2! Can you specify what is the chromosome ID here? I am not sure that I understand correctly. Here are the first few lines of the gff file:
Could you show us the chromosomes/contigs/... as those are present in the sam file, using
samtools idxstats
?It seems that samtools idxstats only works on bam files. Because I set the output of the bowtie2 alignment as sam file, I did not convert them to bam. Do you think that could be a problem? Also, because I can open sam file in a text editor, I post the first few lines of the sam file in here:
I blasted
GTAGCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGG
, turns out to be ribosomal RNA. How was the library prep performed? Ribo depletion or poly A selection?The library was prep using Ribo depletion.
Chromosome id is
NC_003454.1
. You can either use @Wouter's command or show us the output ofsamtools view -H your_file.sam
.I did what you suggested, and here is the out put:
At least the chromosome name is matching in your sam and annotation file. I think counting may not be working because your sam file is not sorted (look at the header). I suggest that you sort your file using
samtools sort
(convert it to BAM) and then count.Alternatively you could use featureCounts to do the counting. It can sort the files as needed.
Thank you so much, genomax2! You are life saver! featureCounts worked beautifully!