Intersecting transcriptome bam file with GTF file
1
0
Entering edit mode
18 months ago
Jalil Sharif ▴ 80

Hello, I aligned artificial rna-seq reads against the genome and transcriptome using STAR. Star also generated a transcriptome.bam file. I want to to have have information of which ENST intersect against a specific part of the GTF, hence I have subset the GTF file.

I have tried bedtools intersect, and also findOverlaps in R.

The findOverlaps did not show any overlaps when running the command as follows:

    gr1 <- GRanges(seqnames = transcriptome_file_overlap$transcript_id, 
                  ranges = IRanges(start = transcriptome_file_overlap$start, end = transcriptome_file_overlap$end),
                  strand = transcriptome_file_overlap$strand)
    gr2 <- GRanges(seqnames = gtf_overlap_subset$transcript_id, 
                  ranges = IRanges(start = gtf_overlap_subset$start, end = gtf_overlap_subset$end),
                  strand = gtf_overlap_subset$strand)
    overlaps1 <- findOverlaps(gr1,gr2)

in R, if I keep the strand information, there are no reads that overlapped, however, the Bam file has more than 18million entries, the GTF file has 25k, and it is unlikely that any did not intersect

Results of findOverlaps

Whereas the bedtools intersect is giving me the following error:

***** WARNING: File transcriptome.bed has inconsistent naming convention for record:
ENST00000359218.10  71889   71914   0   +

The main reason being that the first column in the bam file does not contain chromosome information, which is understandable as it is the transcriptome.

What options are there to find specific reads that have intersected completely with the gtf file?

findoverlaps bedtools transcriptome • 1.9k views
ADD COMMENT
0
Entering edit mode

Whereas the bedtools intersect is giving me the following error

it's not an error, it's a warning. It usually happens when there is a list of chromosomes like 'chr1,chr2,chr3' mixed with chromosome names not starting with 'chr' : 1, 2, 3, GL0012345, etc...

Whereas the bedtools intersect

show us the command line please and show us a few lines of each files.

ADD REPLY
0
Entering edit mode

but in the end, it could be something like:

awk '/^[^#]/ {printf("%s\t%d\t%s\n",$1,int($4)-1,$5);}' in.gtf | sort -t '$\t' -k1,1 -k2,2n | bedtools merge > select.bed

samtools view -L select.bed in.bam
ADD REPLY
0
Entering edit mode

Pierre Lindenbaum, please see the first ten lines for the bed file (from the bam file), and also the gtf file.

I will try the code below.

ADD REPLY
0
Entering edit mode

Please post as text. We can't see data in a screenshot nor can it be used to copy/paste to provide help.

ADD REPLY
0
Entering edit mode

as far as I can see you're mixing genomic coordinates with transcript coordinates. There is no ENST... chromosomes in your GTF (1st column)

ADD REPLY
0
Entering edit mode

Yes, I thought about that too, I subsequently transformed the gtf file into a bed file, as bedtools intersect can also use a bed file. So I excluded the chromosomal information from my derived bed file. As I want to only get the reads that align to a specific aspect of the transcriptome within the gtf file.

and still the same issue.

transcriptome.bed (from the original bam file

ENST00000359218.10 71889 71914 +
ENST00000342175.11 72090 72115 +
ENST00000591111.5 94011 94036 +
ENST00000589042.5 98934 98959 +
ENST00000460472.6 71739 71764 +
ENST00000342992.11 91230 91255 +
ENST00000589434.5 1013 1038 -
ENST00000664161.1 1112 1137 -
ENST00000657210.1 1423 1448 -
ENST00000419746.5 1150 1175 -

gtf file subset for transcripts only.

##gff-version 2
##source-version rtracklayer 1.60.0
##date 2023-05-15
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level "2"; tag "Ensembl_canonical"; transcript_id "ENST00000456328.2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; transcript_support_level "1"; havana_transcript "OTTHUMT00000362751.1"; ID "2"
chr1    HAVANA  transcript  12010   13670   .   +   .   gene_id "ENSG00000223972.6"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level "2"; tag "Ensembl_canonical"; transcript_id "ENST00000450305.2"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_support_level "NA"; havana_transcript "OTTHUMT00000002844.2"; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2"; ont "PGO:0000019"; ID "7"
chr1    HAVANA  transcript  14404   29570   .   -   .   gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level "2"; tag "Ensembl_canonical"; transcript_id "ENST00000488147.1"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-201"; transcript_support_level "NA"; havana_transcript "OTTHUMT00000002839.1"; hgnc_id "HGNC:38034"; havana_gene "OTTHUMG00000000958.1"; ont "PGO:0000005"; ID "15"
chr1    ENSEMBL transcript  17369   17436   .   -   .   gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_name "MIR6859-1"; level "3"; tag "Ensembl_canonical"; transcript_id "ENST00000619216.1"; transcript_type "miRNA"; transcript_name "MIR6859-1-201"; transcript_support_level "NA"; hgnc_id "HGNC:50039"; ID "28"
chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; level "2"; tag "Ensembl_canonical"; transcript_id "ENST00000473358.1"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-202"; transcript_support_level "5"; havana_transcript "OTTHUMT00000002840.1"; hgnc_id "HGNC:52482"; havana_gene "OTTHUMG00000000959.2"; ID "31"
chr1    HAVANA  transcript  30267   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lncRNA"; gene_name "MIR1302-2HG"; level "2"; tag "basic"; transcript_id "ENST00000469289.1"; transcript_type "lncRNA"; transcript_name "MIR1302-2HG-201"; transcript_support_level "5"; havana_transcript "OTTHUMT00000002841.2"; hgnc_id "HGNC:52482"; havana_gene "OTTHUMG00000000959.2"; ID "35"
chr1    ENSEMBL transcript  30366   30503   .   +   .   gene_id "ENSG00000284332.1"; gene_type "miRNA"; gene_name "MIR1302-2"; level "3"; tag "Ensembl_canonical"; transcript_id "ENST00000607096.1"; transcript_type "miRNA"; transcript_name "MIR1302-2-201"; transcript_support_level "NA"; hgnc_id "HGNC:35294"; ID "39"

The gtf file has subsequently been converted into a bed file.

ENST00000000233.10 127588410 127591700 +
ENST00000000442.11 64305523 64316743 +
ENST00000001008.6 2794969 2805423 +
ENST00000002125.9 37231657 37249160 +
ENST00000002829.8 50155323 50189075 +
ENST00000003084.11 117480024 117668665 +
ENST00000003912.7 24415802 24472976 +
ENST00000004103.8 150800768 150805118 +
ENST00000004531.14 17538776 17570561 +
ENST00000005257.7 39623571 39708120 +

If I do:

bedtools intersect -a transcriptome.bed -b ./transcript/CD_STAR_transcript.gtf -f 1.0 >| ./transcript/transcript.bed
***** WARNING: File transcriptome.bed has inconsistent naming convention for record:
ENST00000359218.10  71889   71914   0   +

***** WARNING: File transcriptome.bed has inconsistent naming convention for record:
ENST00000359218.10  71889   71914   0   +

and

bedtools intersect -a transcriptome.bed -b ./transcript/transcript_bed_intersect.bed  -f 1.0 >| ./transcript/transcript.bed
Error: unable to open file or unable to determine types for file ./transcript/transcript_bed_intersect.bed

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the
  expected columns (e.g., cols 2 and 3 for BED).
ADD REPLY
0
Entering edit mode

The gtf file has subsequently been converted into a bed file.

uh ? according to the BAM, your coordinates MUST be chr ,not ENST .

ADD REPLY
0
Entering edit mode

Yes, that is my aim, but the Star Transcriptome file does not have a first column with chromosomal information. The only option would be for me to transform the transcriptome.bam file and add chromosomal information manually.

Well, I guess I could try that.

ADD REPLY
0
Entering edit mode
18 months ago
rfran010 ★ 1.3k

I don't know if you resolved this or not, but it sounds like you want to know which alignments align to a specific set of genes in the GTF. Using your approaches, it would be better to use the genome-aligned BAM output, not the transcriptome aligned BAM.

You can use the transcriptome bam, but it seems to me you might just want to search for the transcript IDs that you want. Maybe extracting the transcript IDs from the GTF section you want, and then finding which transcriptome bam alignments match.

I don't normally work with transcriptome bams, but I'm sure there's tools that work with this format where the chr is replaced by transcript ID. Maybe as a starting point, look for tools that work with kallisto?

ADD COMMENT

Login before adding your answer.

Traffic: 2728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6