Hello, I aligned artificial rna-seq reads against the genome and transcriptome using STAR. Star also generated a transcriptome.bam file. I want to to have have information of which ENST intersect against a specific part of the GTF, hence I have subset the GTF file.
I have tried bedtools intersect, and also findOverlaps in R.
The findOverlaps did not show any overlaps when running the command as follows:
gr1 <- GRanges(seqnames = transcriptome_file_overlap$transcript_id,
ranges = IRanges(start = transcriptome_file_overlap$start, end = transcriptome_file_overlap$end),
strand = transcriptome_file_overlap$strand)
gr2 <- GRanges(seqnames = gtf_overlap_subset$transcript_id,
ranges = IRanges(start = gtf_overlap_subset$start, end = gtf_overlap_subset$end),
strand = gtf_overlap_subset$strand)
overlaps1 <- findOverlaps(gr1,gr2)
in R, if I keep the strand information, there are no reads that overlapped, however, the Bam file has more than 18million entries, the GTF file has 25k, and it is unlikely that any did not intersect
Whereas the bedtools intersect is giving me the following error:
***** WARNING: File transcriptome.bed has inconsistent naming convention for record:
ENST00000359218.10 71889 71914 0 +
The main reason being that the first column in the bam file does not contain chromosome information, which is understandable as it is the transcriptome.
What options are there to find specific reads that have intersected completely with the gtf file?
it's not an error, it's a warning. It usually happens when there is a list of chromosomes like 'chr1,chr2,chr3' mixed with chromosome names not starting with 'chr' : 1, 2, 3, GL0012345, etc...
show us the command line please and show us a few lines of each files.
but in the end, it could be something like:
Pierre Lindenbaum, please see the first ten lines for the bed file (from the bam file), and also the gtf file.
I will try the code below.
Please post as text. We can't see data in a screenshot nor can it be used to copy/paste to provide help.
as far as I can see you're mixing genomic coordinates with transcript coordinates. There is no ENST... chromosomes in your GTF (1st column)
Yes, I thought about that too, I subsequently transformed the gtf file into a bed file, as bedtools intersect can also use a bed file. So I excluded the chromosomal information from my derived bed file. As I want to only get the reads that align to a specific aspect of the transcriptome within the gtf file.
and still the same issue.
transcriptome.bed (from the original bam file
gtf file subset for transcripts only.
The gtf file has subsequently been converted into a bed file.
If I do:
and
uh ? according to the BAM, your coordinates MUST be chr ,not ENST .
Yes, that is my aim, but the Star Transcriptome file does not have a first column with chromosomal information. The only option would be for me to transform the transcriptome.bam file and add chromosomal information manually.
Well, I guess I could try that.