-
Notifications
You must be signed in to change notification settings - Fork 2
Home
The goal of Trommel is "painless data profiling in Hadoop". Trommel is specifically aimed at data miners that would like to profile data stored in Hadoop, but aren't interested in learning the ins and outs of doing so using tools/languages like Hive, Pig, or Java MapReduce. To achieve this goal Trommel implements a "mini Domain-Specific Language (DSL)" called TrommelScript for data profiling tasks. Trommel has been built for/tested with Cloudera's cdh3u3 Distribution (i.e., Hadoop 0.20.2) due to it's populariy and extensive coverage in published books.
Trommel was originally developed by Dave Langer as a Capstone Project for his Masters in Computer Science degree. Trommel was inspired by Dave's obsession with Data Mining/Science, Big Data, and the book "Data Preparation for Data Mining" by Dorian Pyle. Trommel is open source under the Apache License v2.0.
A core tenet of Trommel is providing everything possible to make Trommel users and developers productive as quickly as possible. To this end, please take a look at the following resources for getting up to speed on Trommel (please note that some of the links below are hosted by Dave Langer).
- Check out the Trommel video introduction on YouTube.
- The Trommel Functional Specification is the best place to start for users and developers to get an overview of Trommel.
- If for some reason you're not a Spec fanboy like Dave is, and you just want to see some code, check out Getting Started with Trommel in Eclipse.
- If you need to start profiling data in Hadoop ASAP, then take a look at Installing Trommel v1.0.0.
- If you're wondering why you might want to learn Trommel take a look at Comparing Hive and Trommel.