In this video, you will get a quick overview of Apache Hive, one of the most popular data warehouse components on the big data landscape. It’s mainly used to complement the Hadoop file system with its interface.
Hive was originally developed by Facebook and is now maintained as Apache hive by Apache software foundation. It is used and developed by biggies such as Netflix and Amazon as well.
Why was Hive Developed
The Hadoop ecosystem is not just scalable but also cost effective when it comes to processing large volumes of data. It is also a fairly new framework that packs a lot of punch. However, organizations with traditional data warehouses are based on SQL with users and developers that rely on SQL queries for extracting data.
It makes getting used to the Hadoop ecosystem an uphill task. And that is exactly why hive was developed.
Hive provides SQL intellect, so that users can write SQL like queries called HQL or hive query language to extract the data from Hadoop. These SQL likes queries will be converted into map reduce jobs by the Hive component and that is how it talks to Hadoop ecosystem and HDFS file system.
How and when Hive can be used?
Hive can be used for OLAP (online analytic) processing
It is scalable, fast and flexible
It is a great platform for the SQL users to write SQL like queries to interact with the large datasets that reside on HDFS filesystem
Here is what Hive cannot be used for:
It is not a relational database
It cannot be used for OLTP (online transaction) processing
It cannot be used for real time updates or queries
It cannot be used for scenarios where low latency data retrieval is expected, because there is a latency in converting the HIVE scripts into MAP REDUCE scripts by Hive
Some of the finest features of Hive
It supports different file formats like sequence file, text file, avro file format, ORC file, RC file
Metadata gets stored in RDBMS like derby database
Hive provides lot of compression techniques, queries on the compressed data such as SNAPPY compression, gzip compression
Users can write SQL like queries that hive converts into mapreduce or tez or spark jobs to query against hadoop datasets
Users can plugin mapreduce scripts into the hive queries using UDF user defined functions
Specialized joins are available that help to improve the query performance
If you don’t understand any of the above terms, that is fine. We will look into the above features in detail in our upcoming videos.