Using Apache Hive or Apache Spark v2
You can use the Hadoop Foreign Data Wrapper either through the Apache Hive or the Apache Spark. Both Hive and Spark store metadata in the configured metastore, where databases and tables are created using HiveQL.
Using HDFS FDW with Apache Hive on top of Hadoop
Apache Hive data warehouse software helps with querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time, this language allows traditional map/reduce programmers to plug in their custom mappers and reducers when it's inconvenient or inefficient to express this logic in HiveQL.
You can download the two versions of Hive—HiveServer1 and HiveServer2—from the Apache Hive website.
Note
The Hadoop Foreign Data Wrapper supports only HiveServer2.
To use HDFS FDW with Apache Hive on top of Hadoop:
Download weblogs_parse and follow the instructions at the Wiki Pentaho website.
Upload the
weblog_parse.txtfile using these commands:hadoop fs -mkdir /weblogs hadoop fs -mkdir /weblogs/parse hadoop fs -put weblogs_parse.txt /weblogs/parse/part-00000
Start HiveServer, if not already running, using following command:
$HIVE_HOME/bin/hiveserver2or
$HIVE_HOME/bin/hive --service hiveserver2
Connect to HiveServer2 using the hive beeline client. For example: