搜索本产品文档关键词
Hive使用指南
所有文档
menu

BOS 对象存储

Hive使用指南

Hive

hive是基于Hadoop的一个数据仓库工具,用来进行数据提取、转化、加载,这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。hive数据仓库工具能将结构化的数据文件映射为一张数据库表,并提供SQL查询功能,能将SQL语句转变成MapReduce任务来执行.

前提条件

首先参考BOS HDFS一文安装并配置BOS HDFS,本机安装的hadoop版本为hadoop-3.3.2,参考文中"开始使用"一节完成BOS HDFS的基本试用,并设置环境变量:

Bash
1export HADOOP_HOME=/opt/hadoop-3.3.2
2export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath

安装mysql

mysql用来存储hive的元数据,可以选择本地安装,也可以直接远程连接已安装的mysql或者RDS。本机安装的版本是:mysql-5.1.61-4.el6.x86_64 安装完成之后,可用 service mysqld status 查看运行状态,并使用

Bash
1/usr/bin/mysqladmin -u root -h ${IP} password ${new-password} #设置新密码

可以在mysql中创建hive专用的用户,并设置密码。

安装hive

本机安装的版本为2.3.9. 修改conf文件夹下的两个配置:

Bash
1mv hive-env.sh.template hive-env.sh
2mv hive-site.xml.template hive-site.xml 

在hive-env.sh中添加:

Bash
1export HIVE_CONF_DIR=/ssd2/apache-hive-2.3.9-bin/conf

在hive-site.xml中添加:

XML
1<property>
2    <name>javax.jdo.option.ConnectionURL</name>
3    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
4    <description>MySQL</description>
5</property>
6<property>
7    <name>javax.jdo.option.ConnectionDriverName</name>
8    <value>com.mysql.jdbc.Driver</value>
9    <description>JDBC</description>
10</property>
11<property>
12    <name>javax.jdo.option.ConnectionUserName</name>
13    <value>root</value>
14    <description>username</description>
15    </property>
16<property>
17    <name>javax.jdo.option.ConnectionPassword</name>
18    <value>new-password</value>
19    <description>passward</description>
20</property>

配置中 javax.jdo.option.ConnectionURL 是指mysql server的连接地址,javax.jdo.option.ConnectionUserName 就是用于hive的mysql用户名,javax.jdo.option.ConnectionPassword 是用户名对应的密码。设置完成后,把mysql的JBDC驱动复制到lib文件夹下,本机采用的驱动为 mysql-connector-java-5.1.32-bin.jar. 初始化msyql

Bash
1./bin/schematool -dbType mysql -initSchema 

启动hive

Bash
1./bin/hive

hive测试

创建表

Bash
1create database hive;      // 创建数据库
2create table hive_test (a int, b string) //创建表
3ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
  1. 新建一个shell脚本,名为gen_data.sh
Bash
1#!/bin/bash
2MAXROW=1000000 #指定生成数据行数
3for((i = 0; i < $MAXROW; i++))
4do
5   echo $RANDOM, \"$RANDOM\"
6done

2.运行脚本,生成测试数据

Bash
1./gen_data.sh > hive_test.data

3.把数据加载到表中

Bash
1load data inpath "bos://${bucket_name}/hive_test.data" into table hive.hive_test;

4.查询

Bash
1hive> select count(*) from hive_test;
2WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
3Query ID = root_20230528173013_6f5296db-562e-4342-917f-bcf14fc1480d
4Total jobs = 1
5Launching Job 1 out of 1
6Number of reduce tasks determined at compile time: 1
7In order to change the average load for a reducer (in bytes):
8  set hive.exec.reducers.bytes.per.reducer=<number>
9In order to limit the maximum number of reducers:
10  set hive.exec.reducers.max=<number>
11In order to set a constant number of reducers:
12  set mapreduce.job.reduces=<number>
13Job running in-process (local Hadoop)
142023-05-28 17:30:16,548 Stage-1 map = 0%,  reduce = 0%
152023-05-28 17:30:18,558 Stage-1 map = 100%,  reduce = 100%
16Ended Job = job_local238749048_0001
17MapReduce Jobs Launched:
18Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
19Total MapReduce CPU Time Spent: 0 msec
20OK
211000000
22hive> select * from hive_test limit 10;
23OK
2411027    "11345"
2510227    "24281"
2632535    "16409"
2724286    "24435"
282498     "10969"
2916662    "16163"
305345     "26005"
3121407    "5365"
3230608    "4588"
3319686    "11831"
上一篇
配置和使用
下一篇
Presto使用指南