What is Apache Hive? : Understanding Hive

50%

1 Likes

7 January, 2021

By Bit2Me

In this video, you will get a quick overview of Apache Hive, one of the most popular data warehouse components on the big data landscape. It’s mainly used to complement the Hadoop file system with its interface.
Hive was originally developed by Facebook and is now maintained as Apache hive by Apache software foundation. It is used and developed by biggies such as Netflix and Amazon as well.

Why was Hive Developed
=====================
The Hadoop ecosystem is not just scalable but also cost effective when it comes to processing large volumes of data. It is also a fairly new framework that packs a lot of punch. However, organizations with traditional data warehouses are based on SQL with users and developers that rely on SQL queries for extracting data.

It makes getting used to the Hadoop ecosystem an uphill task. And that is exactly why hive was developed.

Hive provides SQL intellect, so that users can write SQL like queries called HQL or hive query language to extract the data from Hadoop. These SQL likes queries will be converted into map reduce jobs by the Hive component and that is how it talks to Hadoop ecosystem and HDFS file system.

How and when Hive can be used?
===========================
 Hive can be used for OLAP (online analytic) processing
 It is scalable, fast and flexible
 It is a great platform for the SQL users to write SQL like queries to interact with the large datasets that reside on HDFS filesystem
Here is what Hive cannot be used for:
==============================
 It is not a relational database
 It cannot be used for OLTP (online transaction) processing
 It cannot be used for real time updates or queries
 It cannot be used for scenarios where low latency data retrieval is expected, because there is a latency in converting the HIVE scripts into MAP REDUCE scripts by Hive
Some of the finest features of Hive
============================
 It supports different file formats like sequence file, text file, avro file format, ORC file, RC file
 Metadata gets stored in RDBMS like derby database
 Hive provides lot of compression techniques, queries on the compressed data such as SNAPPY compression, gzip compression
 Users can write SQL like queries that hive converts into mapreduce or tez or spark jobs to query against hadoop datasets
 Users can plugin mapreduce scripts into the hive queries using UDF user defined functions
 Specialized joins are available that help to improve the query performance
If you don’t understand any of the above terms, that is fine. We will look into the above features in detail in our upcoming videos.