Skip to content

Comparing Hive and Trommel

EasyD edited this page Nov 5, 2012 · 1 revision

Introduction

The Hadoop ecosystem offers a number of options for working with data stored in HDFS, with Hive being one of the most popular. With Hive's SQL-like language it isn't hard to understand why Hive has become one of the most prevalent workloads ran on Hadoop clusters. As with traditional RDBMS SQL languages, Hive Query Language (HQL) provides a number of facilities for profiling data. As such, comparing data profiling using HQL vs. data profiling using TrommelScript is an excellent test to see if Trommel's Domain-Specific Language (DSL) approach offers demonstrable benefit over existing Hadoop technologies.


Comparing HQL and TrommelScript

As with other pages in this Wiki, we'll be working with the famous Iris machine learning data set and profile it to ascertain the data set's suitability for mining. To keep things simple (and to avoid the appearance of bias ;-), we'll limit the comparison to only the numeric data and Trommel functions that map to out of the box HQL facilities.

The following HQL illustrates profiling all four numeric fields in the Iris data set and storing the results in a hive table: max/min values, counts of distinct/empty values, and variability as measured by the sample standard deviation:

 CREATE EXTERNAL TABLE IF NOT EXISTS Iris (SepalLength DOUBLE, SepalWidth DOUBLE, PetalLength DOUBLE, 
                                           PetalWidth DOUBLE, Class STRING)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 STORED AS TEXTFILE
 LOCATION '/TrommelData';

 DROP TABLE IF EXISTS IrisProfile;
 CREATE TABLE IrisProfile AS
 SELECT * 
 FROM (SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
  FROM (SELECT "SepalLength" AS Field, Max(SepalLength) AS Maximum, Min(SepalLength) AS Minimum, 
               COUNT(DISTINCT SepalLength) AS DistinctValues, stddev_samp(SepalLength) AS Variance FROM Iris) Iris1
       JOIN (SELECT "SepalLength" AS Field, COUNT(1) AS Empty FROM Iris WHERE SepalLength IS NULL) Iris2 
            ON (Iris1.Field = Iris2.Field)
  UNION ALL
  SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
  FROM (SELECT "SepalWidth" AS Field, Max(SepalWidth) AS Maximum, Min(SepalWidth) AS Minimum, 
               COUNT(DISTINCT SepalWidth) AS DistinctValues, stddev_samp(SepalWidth) AS Variance FROM Iris) Iris1
       JOIN (SELECT "SepalWidth" AS Field, COUNT(1) AS Empty FROM Iris WHERE SepalWidth IS NULL) Iris2 
            ON (Iris1.Field = Iris2.Field)
  UNION ALL
  SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
  FROM (SELECT "PetalLength" AS Field, Max(PetalLength) AS Maximum, Min(PetalLength) AS Minimum, 
               COUNT(DISTINCT PetalLength) AS DistinctValues, stddev_samp(PetalLength) AS Variance FROM Iris) Iris1
       JOIN (SELECT "PetalLength" AS Field, COUNT(1) AS Empty FROM Iris WHERE PetalLength IS NULL) Iris2 
            ON (Iris1.Field = Iris2.Field)
  UNION ALL
  SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
  FROM (SELECT "PetalWidth" AS Field, Max(PetalWidth) AS Maximum, Min(PetalWidth) AS Minimum, 
               COUNT(DISTINCT PetalWidth) AS DistinctValues, stddev_samp(PetalWidth) AS Variance FROM Iris) Iris1
       JOIN (SELECT "PetalWidth" AS Field, COUNT(1) AS Empty FROM Iris WHERE PetalWidth IS NULL) Iris2 
            ON (Iris1.Field = Iris2.Field)
 ) IrisAll;

The following screenshot shows running the HQL above and selecting the data on a MacBook Pro running Hadoop in pseudo cluster mode:

Hive Job

By way of comparison, the following is the equivalent TrommelScript:

 LOAD DATA '/TrommelData/IrisData.txt' AS (SepalLength : NUMERIC, SepalWidth : NUMERIC, 
                                          PetalLength : NUMERIC, PetalWidth : NUMERIC,
                                          Class : CATEGORICAL)
 FIELDS terminated BY ',';

 PROFILE PetalLength, PetalWidth, SepalLength, SepalWidth 
 WITH MAX, MIN, DISTINCT, VAR, EMPTY
 STORE INTO '/TrommelOutput/IrisProfile';

And the following screenshot shows running the above TrommelScript on the same MacBook Pro:

Trommel Job


Conclusions

  1. While the HQL isn't particularly difficult for anyone familiar with SQL in general, it is still far more verbose and error-prone than the equivalent TrommelScript.
  2. Trommel offers better performance than Hive (i.e., 170 seconds vs. 20 seconds) as Trommel runs a single MapReduce job vs. 14 MapReduce jobs for Hive. As one might imagine, the performance disparity would get even worse with larger data sets.
  3. It's entirely possible that the HQL could be further optimized over what was written, but the time to research/optimize the code would undoubtedly take longer than just learning TrommelScript.
  4. A true apples-to-apples comparison of HQL with TrommelScript would require the use of Python or Java in addition to HQL (e.g., interstitial linearity), thereby exploding the amount of code/complexity for a Hive-based solution.

Bottom Line

As one would expect, TrommelScript is far more efficient in tackling the problem domain than a generalized language like HQL. Combining DSLs like TrommelScript with Hadoop offers a compelling value proposition for further democratization of Big Data technology.

Clone this wiki locally