Question

Summarizing the Results in R

0

Entering edit mode

8.2 years ago

alizohaib7 ▴ 10

I am doing some annotation work. I am having two sets of genes. Set one is my reference database and set 2 is my query sequence. I did a local blast on an online server. It gives me output similar to that shown in the picture. My query database contains about 10 million reads and is annotated with a large number of reference viruses.

I want to know 2 things:

How many of my query reads match to each reference. The output resembling like X query reads matched to virus Y and X query reads matched to virus alpha and so on. As there are so many reads and so many viruses to which my reads matched. How can I do this in R? Which reads matched to virus. Can you please provide me ready to do commands to work in R.

enter image description here

R blast rna-seq RNA-Seq genome • 1.3k views

ADD COMMENT • link updated 8.2 years ago by michael.ante ★ 4.0k • written 8.2 years ago by alizohaib7 ▴ 10

score 1 · Answer 1 · 2016-12-21

1

Entering edit mode

8.2 years ago

michael.ante ★ 4.0k

The query IDs look like Illumina read IDs. Instead of the blast approach, I'd go for aligning the reads with Bowtie2, BWA, or BBmap against the combined virus-genomes, you detected. On this alignment, you can do a lot of statistics.

ADD COMMENT • link 8.2 years ago by michael.ante ★ 4.0k

0

Entering edit mode

Yeah or do a de novo assembly first (before blast). Since it's viral it is not that tough for most computers.

ADD REPLY • link 8.2 years ago by Benn 8.4k

score 0 · Answer 2 · 2016-12-21

0

Entering edit mode

8.2 years ago

Benn 8.4k

If you import your txt file in R:

df <- read.table("file.txt", sep = "\t", header=T)

you can simply summarize it with the table function:

table(df[,2])

ADD COMMENT • link 8.2 years ago by Benn 8.4k