Skip to content
EasyD edited this page Nov 21, 2012 · 16 revisions

Welcome to the Trommel wiki!

The goal of Trommel is "painless data profiling in Hadoop". Trommel is specifically aimed at data miners that would like to profile data stored in Hadoop, but aren't interested in learning the ins and outs of doing so using tools/languages like Hive, Pig, or Java MapReduce. To achieve this goal Trommel implements a "mini Domain-Specific Language (DSL)" called TrommelScript for data profiling tasks. Trommel has been built for/tested with Cloudera's cdh3u3 Distribution (i.e., Hadoop 0.20.2) due to it's populariy and extensive coverage in published books.

Trommel was originally developed by Dave Langer as a Capstone Project for his Masters in Computer Science degree. Trommel was inspired by Dave's obsession with Data Mining/Science, Big Data, and the book "Data Preparation for Data Mining" by Dorian Pyle. Trommel is open source under the Apache License v2.0.


Getting Started

A core tenet of Trommel is providing everything possible to make Trommel users and developers productive as quickly as possible. To this end, please take a look at the following resources for getting up to speed on Trommel (please note that some of the links below are hosted by Dave Langer).


Current Release - v1.0.0

Clone this wiki locally