-
Notifications
You must be signed in to change notification settings - Fork 2
Comparing Hive and Trommel
The Hadoop ecosystem offers a number of options for working with data stored in HDFS, with Hive being one of the most popular. With Hive's SQL-like language it isn't hard to understand why Hive has become one of the most prevalent workloads ran on Hadoop clusters. As with traditional RDBMS SQL languages, Hive Query Language (HQL) provides a number of facilities for profiling data. As such, comparing data profiling using HQL vs. data profiling using TrommelScript is an excellent test to see if Trommel's Domain-Specific Language (DSL) approach offers demonstrable benefit over existing Hadoop technologies.
As with other pages in this Wiki, we'll be working with the famous Iris machine learning data set and profile it to ascertain the data set's suitability for mining. To keep things simple (and to avoid the appearance of bias ;-), we'll limit the comparison to only the numeric data and Trommel functions that map to out of the box HQL facilities.
The following HQL illustrates profiling all four numeric fields in the Iris data set and storing the results in a hive table: max/min values, counts of distinct/empty values, and variability as measured by the sample standard deviation:
CREATE EXTERNAL TABLE IF NOT EXISTS Iris (SepalLength DOUBLE, SepalWidth DOUBLE, PetalLength DOUBLE,
PetalWidth DOUBLE, Class STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/TrommelData';
DROP TABLE IF EXISTS IrisProfile;
CREATE TABLE IrisProfile AS
SELECT *
FROM (SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
FROM (SELECT "SepalLength" AS Field, Max(SepalLength) AS Maximum, Min(SepalLength) AS Minimum,
COUNT(DISTINCT SepalLength) AS DistinctValues, stddev_samp(SepalLength) AS Variance FROM Iris) Iris1
JOIN (SELECT "SepalLength" AS Field, COUNT(1) AS Empty FROM Iris WHERE SepalLength IS NULL) Iris2
ON (Iris1.Field = Iris2.Field)
UNION ALL
SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
FROM (SELECT "SepalWidth" AS Field, Max(SepalWidth) AS Maximum, Min(SepalWidth) AS Minimum,
COUNT(DISTINCT SepalWidth) AS DistinctValues, stddev_samp(SepalWidth) AS Variance FROM Iris) Iris1
JOIN (SELECT "SepalWidth" AS Field, COUNT(1) AS Empty FROM Iris WHERE SepalWidth IS NULL) Iris2
ON (Iris1.Field = Iris2.Field)
UNION ALL
SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
FROM (SELECT "PetalLength" AS Field, Max(PetalLength) AS Maximum, Min(PetalLength) AS Minimum,
COUNT(DISTINCT PetalLength) AS DistinctValues, stddev_samp(PetalLength) AS Variance FROM Iris) Iris1
JOIN (SELECT "PetalLength" AS Field, COUNT(1) AS Empty FROM Iris WHERE PetalLength IS NULL) Iris2
ON (Iris1.Field = Iris2.Field)
UNION ALL
SELECT Iris1.Field, Iris1.Maximum, Iris1.Minimum, Iris1.DistinctValues, Iris1.Variance, Iris2.Empty
FROM (SELECT "PetalWidth" AS Field, Max(PetalWidth) AS Maximum, Min(PetalWidth) AS Minimum,
COUNT(DISTINCT PetalWidth) AS DistinctValues, stddev_samp(PetalWidth) AS Variance FROM Iris) Iris1
JOIN (SELECT "PetalWidth" AS Field, COUNT(1) AS Empty FROM Iris WHERE PetalWidth IS NULL) Iris2
ON (Iris1.Field = Iris2.Field)
) IrisAll;
The following screenshot shows running the HQL above and selecting the data on a MacBook Pro running Hadoop in pseudo cluster mode:
By way of comparison, the following is the equivalent TrommelScript:
LOAD DATA '/TrommelData/IrisData.txt' AS (SepalLength : NUMERIC, SepalWidth : NUMERIC,
PetalLength : NUMERIC, PetalWidth : NUMERIC,
Class : CATEGORICAL)
FIELDS terminated BY ',';
PROFILE PetalLength, PetalWidth, SepalLength, SepalWidth
WITH MAX, MIN, DISTINCT, VAR, EMPTY
STORE INTO '/TrommelOutput/IrisProfile';
And the following screenshot shows running the above TrommelScript on the same MacBook Pro:
- While the HQL isn't particularly difficult for anyone familiar with SQL in general, it is still far more verbose and error-prone than the equivalent TrommelScript.
- Trommel offers better performance than Hive (i.e., 170 seconds vs. 20 seconds) as Trommel runs a single MapReduce job vs. 14 MapReduce jobs for Hive. As one might imagine, the performance disparity would get even worse with larger data sets.
- It's entirely possible that the HQL could be further optimized over what was written, but the time to research/optimize the code would undoubtedly take longer than just learning TrommelScript.
- A true apples-to-apples comparison of HQL with TrommelScript would require the use of Python or Java in addition to HQL (e.g., interstitial linearity), thereby exploding the amount of code/complexity for a Hive-based solution.
As one would expect, TrommelScript is far more efficient in tackling the problem domain than a generalized language like HQL. Combining DSLs like TrommelScript with Hadoop offers a compelling value proposition for further democratization of Big Data technology.