Hi guys, I really do not know what is happening to my RNA seq data analysis pipeline and hence I'm looking for some help. Although I found some related posts and I tried to follow instructions to solve the problem I'm still here and I'm getting crazy.
Briefly: the steps are the following:
STAR:
for i in *fastq.gz; do STAR --runMode alignReads --genomeLoad LoadAndKeep --readFilesCommand zcat --outSAMtype BAM Unsorted --genomeDir /.../star_indices_overhang100 --readFilesIn $i ${i%.fastq.gz}.fastq.gz --runThreadN 10 --outFileNamePrefix ${i%.fastq.gz} done
This is ok. Then:
samtools merge S4_merged.bam S4_1_sorted.bam S4_2_sorted.bam S4_3_sorted.bam S4_4_sorted.bam samtools sort S4_merged.bam S4_4_sorted samtools index S4_4_sorted
then:
samtools view -F 4 S4_4_sorted | htseq-count -f bam -m intersection-nonempty -t exon -i gene_id -a 15 -r pos -s no - /.../hg38.refseq.gtf > Counts_S4.txt
Error:
100000 GFF lines processed.
200000 GFF lines processed.
300000 GFF lines processed.
400000 GFF lines processed.
500000 GFF lines processed.
600000 GFF lines processed.
700000 GFF lines processed.
800000 GFF lines processed.
900000 GFF lines processed.
926832 GFF lines processed.
Error occured when reading beginning of SAM/BAM file.
expected string or Unicode object, file found
[Exception type: TypeError, raised in csamtools.pyx:477]
samtools view -H S4_4_sorted.bam
@HD VN:1.4 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chrX LN:155270560
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr20 LN:63025520
@SQ SN:chrY LN:59373566
@SQ SN:chr19 LN:59128983
@SQ SN:chr22 LN:51304566
@SQ SN:chr21 LN:48129895
@SQ SN:chr1_gl000192_random LN:547496
@SQ SN:chrUn_gl000225 LN:211173
@SQ SN:chr4_gl000194_random LN:191469
@SQ SN:chr4_gl000193_random LN:189789
@SQ SN:chr9_gl000200_random LN:187035
@SQ SN:chrUn_gl000222 LN:186861
@SQ SN:chrUn_gl000212 LN:186858
@SQ SN:chr7_gl000195_random LN:182896
........................................................................
more hg38.refseq.gtf. chr1 hg38_ncbiRefSeq stop_codon 67093005 67093007 0.000000 - . gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq CDS 67093008 67093604 0.000000 - 0 gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq exon 67092176 67093604 0.000000 - . gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq CDS 67095235 67095421 0.000000 - 1 gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq exon 67095235 67095421 0.000000 - . gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq CDS 67096252 67096321 0.000000 - 2 gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq exon 67096252 67096321 0.000000 - . gene_id "XM_011541469.1"; transcript_id "XM_011541469.1"; chr1 hg38_ncbiRefSeq CDS 67103238 67103382 0.000000 - 0 gene_id "XM_011541469.1"; transcript_id "XM_011541469.1";
Can anyone help me please?
Samtools view converts it to sam format, but them you are telling htseq-count that it is a bam file. So skip the view command. (Or drop -f bam)