This can be done offline and wont require too much of computational resource.
What you will need:
- A fast short read aligner such as STAR or even bowtie (STAR is faster)
- Genome sequence (you will have to build index for the genome for your aligner)
- A GTF annotations file (get it from GENCODE or any other standard genome repository for your organism of interest)
Before you align remove redundant reads. Keep their counts if necessary.
Align the reads using any of these aligners and obtain the alignment co-ordinates. The default output is SAM format for STAR and a tabular format for bowtie (bowtie also gives SAM).
- Column-3 of SAM shows the name of the reference sequence where alignment happened (chromosome)
- Column-4 is sequence start
- Column-10 is read sequence. Add the length of this to column 4 value to get stop site.
The columns are tab separated
Now define a window that you define as proximal/nearby (lets say 500nt).
Now all you have to do is find genes that lie ±500nt from your start/stop sites. In your reference GTF parse for the lines that have the feature "gene".
I am giving an example using awk. You can use any programming language you are comfortable with. Check the GTF format also.
Assuming that you made a file (reads.txt) from your SAM output in this format:
Chromosome <tab> Orientation (+/-) <tab> Start <tab> Stop
I am giving an example awk script:
example.awk
#!/bin/gawk
BEGIN{FS=OFS="\t"}
NR==FNR{
a[$1 FS $2][$3 FS $4] # store the co-ordinate information from your reads file
next
}
$3=="gene" && ($4 FS $7) in a{ # quick parse for reference chromosome and orientation
i=$4 FS $7
for(j in a[i]){
split(j,jj,FS)
if(jj[1]>=$4 && jj[2]<=$5)
print $0";contained"
else if($4<=jj[2]+500 || $5>=jj[1]-500)
print $0";partial overlap/proximal"
}
}
call it like this:
awk -f example.awk reads.txt annotations.gtf
NOTE: In the above script I have not considered antisense proximity. If you want to allow that then don't parse for orientation. Also gawk version < 4.0 doesn't allow multidimensional arrays. So install gawk>=4.0
The output is by default a GTF because you are printing selected lines of the reference GTF.
what is the target database that you want to search
Starting with human but ideally including common contaminants (viruses, bacteria) from a human blood sample.