Hadoop Hive

What is and Why Hive?

What is Hive?

  • Provides data warehousing solution built on top of Hadoop
    • ช่วยอำนวยความสะดวกในการค้นหาและจัดการชุดข้อมูลขนาดใหญ่ที่อยู่ใน Hadoop Storage (HDFS หรือ HBase)
  • มาพร้อมกับภาษา SQL ทีเรียกว่า HiveQL
    • SQL has huge developer base
    • นำความสามารถของ Hadoop มาใข้ (querying, analyzing, and summarizing large amounts of data)
  • Allows you to project structure to any data formats
    • Can handle non-structured data
  • Open source Apache project
    • Started at Facebook

Hive is NOT designed for

  • Hive is not for real-time queries
    • Hive is still a batch operation: Hadoop jobs incur substantial overheads in job submission and scheduling – not suitable for real-time queries.
    • ผลที่ตามมา – ความล่าช้าในการค้นหาของ Hive ซึ่งโดยทั่วไปมีสูงมากแม้ว่าข้อมูลนั้นจะมีขนาดเพียงหลักร้อยเมกะไบต์
    • Hive ได้รับการออกแบบมาเพื่อความสามารถในการปรับขนาดและใช้งานได้ง่าย
  • Hive is not for Online Transaction Processing (which requires real time and frequent write operations)
    • Hive is more for Online Transaction Analysis (in Batch mode), which requires mostly read operations
    • Hive does not support UPDATE/DELETE/INSERT a single row

Example Hive Applications

  • Log processing
  • Text mining
  • Document indexing
  • Customer-facing business intelligence
    • Example: Google Analytics
  • Predictive modeling
  • Hypothesis testing

 

Hive VS RDMS
Similarities Similarities
– HiveQL (SQL-like language)
– Similar data model-database, table, view
– SQL
– Similar data model-database, table, view
Differences Differences
– Designed for OLAP
– Batch oriented and high latency operations
– Schema on Read
– Design for OLTP
– Real-time and low latency operations
– Schema on Write

 

Hive VS PIG
Similarities Similarities
– Batch oriented
– HiveQL gets translated into MapReduce jobs
– Targeted for non-Java developers (SQL developers)
– Batch oriented
– PIG statement gets translated into MapReduce jobs
– Targeted for non-Java developers
Differences Differences
– HiveQL (like SQL) is a declarative language with a single result set – PIG Latin is procedural data flow language with input and out for each step

 

Hive Architecture

Hive Internals

  • Hive แปลคำสั่ง HiveQL เป็นชุดงานของ MapReduce ซึ่งจะดำเนินการนี้อยู่ใน Hadoop Cluster
  • Abstracted notions of traditional RDBMS are provided over Hadoop data sets through “metadata” store
    • Database – namespace containing a set of tables
    • Tables
    • Schemas
    • Indexing
    • View
  • Hive specific data models are also supported
    • Partitions
    • Buckets

  • Hive Interface options
    • Command Line Interface (CLI)
    • Web interface
    • Thrift Server – Any JDBC an ODBC client apps can access Hive
  • The “metadata” store is maintained internally via embedded Derby database (MySQL and other RDBMS are also supported)
    • Table definitions
    • Mapping to HDFS

 

HiveQL

  • Types of HiveQL operations
    • DDL operations
    • DML operations
    • SQL operations
  • How to execute HiveQL
    • Use Hive shell – just type “hive”
    • Run Hive script file – type “hive –f <script-file>”

HiveQL – DDL operations

  • Create a table

  • Show tables

  • Describe a table

  • Drop a table

 

HiveQL – DML operations

  • Loading flat files into Hive

  • No verification of incoming data
    • If the loaded data is not compliant with the schema specified forthe table, it will be set to NULL

HiveQL – SQL operations

HiveQL – Built-in Functions

  • HiveQL comes with a bunch of built-in functions
  • Hive -> SHOW FUNCTIONS ;

Physical Layout

  • Hive warehouse directory in HDFS
    • /user/hive/warehouse (default) – the default can be changed via hive.metastore.warehouse.dir
  • Hive Table is represented as a directory under Hive warehouse directory
    • /user/hive/warehouse/students directory – represents “students” Hive table
    • /user/hive/warehouse/book directory – represents “book” Hive table
    • Actual data is stored in flat files under the directory

Example: Physical Layout

  • In the example below, there are 5 directories representing 5 tables

 

Loading Data into Hive Tables

Multiple Schemes of Loading Data

#1: Use Hive LOAD DATA command (Most common: our focus here)

  • To load data from flat files
  • Data files can be loaded from Local or HDFS (default)
  • Data files can be loaded Internal (default) vs EXTERNAL

#2: Manually copying data files into the “/user/hive/warehouse/<tabledirectory>” Hive directory

  • Hive automatically recognize them

#3: Use Hive INSERT command

  • To load data from another HIVE table using SELECT (It is not for inserting a new record)

#4: Use Apache Sqoop to move RDMBS data to Hadoop

#5: Use Apache Flume to move large data sets to Hadoop

 

Loading data from Local vs HDFS

  • Load data from local file system

 

  • Load data from HDFS (default)

Loading data Internal vs EXTERNAL

  • Load data Internally
    • Data is moved to Hive storage (/user/hive/warehouse)
    • การลบตาราง Hive ข้อมูลก็จะถูกลบด้วย
  • Load data with EXTERNAL … LOCATION
    • ใช้ข้อมูลที่อยู่ใน HDFS
    • Hive just has pointers to the existing data in HDFS when a table gets created
    • ลบตาราง Hive ข้อมูลใน HDFS จะไม่ถูกลบ
  • Load data Internal

  • Load data EXTERNAL (done at CREATE time, no LOAD required)

 

Load Data from Existing Table using INSERT … SELECT

  • Use it when you want to load data from existing table or tables

 

Hive Partitions

What is and Why Partitions?

  • ข้อมูลภายในตารางถูกแบ่งออกเป็นหลาย partitions
    • เพื่อเพิ่มประสิทธิภาพการทำงานของการค้นหาข้อมูล
  • Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS
  • When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O required by the query
    • For example, you might want to partition weather data base on year – when query operation is done per year, only that partition need to be accessed

Partitioning Data Mechanism

  • One or more partition columns may be specified
  • Creates a sub-directory for each value of the partition column
  • Queries with the partition columns in WHERE clause will scan through only a subset of the data

Creating Partitions

  • Specify the partition column

Load data into Partitions

  • Each partition is loaded with data specific for the partition column

Querying Partitions

  • By specifying the where condition with the partition, only that specific partition will be queried

 

Hive Joins

JOIN

  • Hive supports
    • Inner Join (default)
    • Outer Join – Left Outer, Right Outer, Full Outer
  • การ JOIN สามารถทำได้ในสองรูปแบบ
    • Using SELECT … INNER JOIN.., SELECT… LEFT OUTER JOIN, etc
    • Create a joined table from

What is and Why Buckets?

  • เป็นกลไกในการสืบค้นและตรวจสอบจากการสุ่มข้อมูลตัวอย่าง
  • Breaks data into a set of buckets based on a hash function of a “bucket column” and allows execution of queries on a subset of random data
  • Bucketing must be configured
    • hive> hive.enforce.bucketing = true;