Hi,
I downloaded the Repetitive DNA file (rmsk) for human from UCSC website and I want to split this file according to class and family to get some basic statistics using R.
Hi,
I downloaded the Repetitive DNA file (rmsk) for human from UCSC website and I want to split this file according to class and family to get some basic statistics using R.
Using mysql ucsc (used mouse here, use hg19.simpleRepeat or hg19.nestedRepeat for human )
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D mm10 -e 'select repClass,repFamily,count(*) from rmsk group by 1,2'
+----------------+----------------+----------+
| repClass | repFamily | count(*) |
+----------------+----------------+----------+
| DNA | DNA | 1001 |
| DNA | hAT | 1949 |
| DNA | hAT-Blackjack | 4564 |
| DNA | hAT-Charlie | 105698 |
| DNA | hAT-Tip100 | 9331 |
| DNA | hAT-Tip100? | 105 |
| DNA | hAT? | 585 |
| DNA | MuDR | 153 |
| DNA | MULE-MuDR | 583 |
| DNA | PiggyBac | 209 |
| DNA | PiggyBac? | 141 |
| DNA | TcMar | 52 |
| DNA | TcMar-Mariner | 1079 |
| DNA | TcMar-Pogo | 21 |
| DNA | TcMar-Tc2 | 1786 |
| DNA | TcMar-Tigger | 35118 |
| DNA | TcMar? | 702 |
| DNA? | DNA? | 1027 |
| LINE | CR1 | 14155 |
| LINE | Dong-R4 | 138 |
| LINE | L1 | 905176 |
| LINE | L1? | 52 |
| LINE | L2 | 67909 |
| LINE | RTE-BovB | 260 |
| LINE | RTE-X | 1703 |
| LINE? | Penelope? | 42 |
| Low_complexity | Low_complexity | 386539 |
| LTR | ERV1 | 71980 |
| LTR | ERV1? | 115 |
| LTR | ERVK | 319317 |
| LTR | ERVK? | 4185 |
| LTR | ERVL | 118061 |
| LTR | ERVL-MaLR | 454918 |
| LTR | ERVL? | 520 |
| LTR | Gypsy | 1859 |
| LTR | Gypsy? | 819 |
| LTR | LTR | 819 |
| LTR? | LTR? | 941 |
| Other | Other | 19450 |
| RC | Helitron | 345 |
| RC? | Helitron? | 74 |
| RNA | RNA | 691 |
| rRNA | rRNA | 1564 |
| Satellite | centr | 4 |
| Satellite | Satellite | 36865 |
| scRNA | scRNA | 8332 |
| Simple_repeat | Simple_repeat | 1015643 |
| SINE | Alu | 574557 |
| SINE | B2 | 372923 |
| SINE | B4 | 397726 |
| SINE | Deu | 1702 |
| SINE | ID | 64047 |
| SINE | MIR | 120436 |
| SINE | tRNA | 1618 |
| SINE? | SINE? | 274 |
| snRNA | snRNA | 3007 |
| srpRNA | srpRNA | 437 |
| tRNA | tRNA | 4769 |
| Unknown | Unknown | 6791 |
| Unknown | Y-chromosome | 2869 |
+----------------+----------------+----------+
If you really want to use R (using mysql is actually faster), then you just want the split()
command.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What have you tried and are you using the .out files or something from the table browser?
Yes. I download this file from the following link
what kind of statistics, do you need the DNA sequences ?
I want to get some basic statistics like frequencies for each family to and class to compare this file with mouse to see if there is any relation between them according to families and classes.