Question

programmatic fast search of 100-1000 short reads to a public server and obtain list of nearby genes

0

Entering edit mode

10.9 years ago

14134125465346445 ★ 3.6k

What are the options for programmatic fast searching of 100-1000 short reads to a public server and obtain list of nearby genes where the reads map?

Input: ~100-1000 short reads

Output: GFF list of genes it maps to or nearby genes

Genome UCSC is restrictive to the number of searches and AFAICS it won't allow programmatic use.

blat • 3.6k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by 14134125465346445 ★ 3.6k

0

Entering edit mode

what is the target database that you want to search

ADD REPLY • link 10.9 years ago by Istvan Albert 103k

0

Entering edit mode

Starting with human but ideally including common contaminants (viruses, bacteria) from a human blood sample.

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by 14134125465346445 ★ 3.6k

Ram · Answer 1 · 2014-10-03

I don't know of any online tool that supports rapid online querying.

For any situation where you want to match reads fast and efficiently the best solution is to download the genome and annotations and run the alignment and matching yourself. It is probably a five line script that people here could easily advise you on.

If that is not an option then your best bet is to use Galaxy and create a workflow that chains together the actions that you need.

Ram · Answer 2 · 2014-10-09

This can be done offline and wont require too much of computational resource.

What you will need:

A fast short read aligner such as STAR or even bowtie (STAR is faster)
Genome sequence (you will have to build index for the genome for your aligner)
A GTF annotations file (get it from GENCODE or any other standard genome repository for your organism of interest)

Before you align remove redundant reads. Keep their counts if necessary.

Align the reads using any of these aligners and obtain the alignment co-ordinates. The default output is SAM format for STAR and a tabular format for bowtie (bowtie also gives SAM).

Column-3 of SAM shows the name of the reference sequence where alignment happened (chromosome)
Column-4 is sequence start
Column-10 is read sequence. Add the length of this to column 4 value to get stop site.

The columns are tab separated

Now define a window that you define as proximal/nearby (lets say 500nt).

Now all you have to do is find genes that lie ±500nt from your start/stop sites. In your reference GTF parse for the lines that have the feature "gene".

I am giving an example using awk. You can use any programming language you are comfortable with. Check the GTF format also.

Assuming that you made a file (reads.txt) from your SAM output in this format:

Chromosome <tab> Orientation (+/-) <tab> Start <tab> Stop

I am giving an example awk script:

example.awk

    #!/bin/gawk

    BEGIN{FS=OFS="\t"}

    NR==FNR{
        a[$1 FS $2][$3 FS $4] # store the co-ordinate information from your reads file
        next
    }

    $3=="gene" && ($4 FS $7) in a{ # quick parse for reference chromosome and orientation

        i=$4 FS $7
        for(j in a[i]){
            split(j,jj,FS)
            if(jj[1]>=$4 && jj[2]<=$5)
                print $0";contained"

            else if($4<=jj[2]+500 || $5>=jj[1]-500)
                print $0";partial overlap/proximal"
            }
    }

call it like this:

awk -f example.awk reads.txt annotations.gtf

NOTE: In the above script I have not considered antisense proximity. If you want to allow that then don't parse for orientation. Also gawk version < 4.0 doesn't allow multidimensional arrays. So install gawk>=4.0

The output is by default a GTF because you are printing selected lines of the reference GTF.

Ram · Answer 3 · 2014-10-17

Answering my own question. Triggered by another user's request, there is now a "bwa shm" command in a dev branch of bwa:

git clone https://github.com/lh3/bwa.git
cd bwa
git checkout -b dev remotes/origin/dev
make

index=/opt/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa
./bwa shm $index # load the index into shared memory
./bwa mem ref.fa reads.fq
./bwa shm -l # list indices in shared memory
./bwa shm -d # unload all indices

The second part of the question can be adapted from another of the answers like this:

/opt/development/tests/bwa/bwa mem \
  /opt/scratch/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa \
  reads.fasta \
  2>/dev/null | \
  samtools view \
  -b \
  -t $index \
  - | \
  bedtools bamtobed -i - | \
  bedtools intersect \
  -b - \
  -a /opt/development/tests/annotations/gencode.v19.annotation.gene.gtf | \
  perl -lne 'print $1 if ($_ =~ /transcript_name\ \"(\S+)\"/)' | \
  sort | \
  uniq -c

      1 ERBB4
      2 PTPN11
      9 RAD50

score 0 · Answer 4 · 2014-10-17

0

Entering edit mode

10.8 years ago

Matt Shirley 10k

Take a look at NCBI SRA-BLAST.

ADD COMMENT • link 10.8 years ago by Matt Shirley 10k