数据仓库 Hive(内含大数据镜像下载)

安装没成功:直接用现成的镜像
大数据Linux实验环境虚拟机镜像文件 http://dblab.xmu.edu.cn/blog/1645-2/
cloudera-quickstart-vm-集成了大数据平台的虚拟机镜像

1. 特点

  • 查询语言与 SQL 接近
  • 并行执行
  • 使用 HDFS 存储
  • 支持多种数据格式
  • 不支持数据更新
  • 不支持索引
  • 执行延迟高(不适合在线数据查询)
  • 可扩展性高
  • 数据规模大

2. Hive 体系架构

3. 安装 Hive

3.1 安装 MySQL

  • 在 master 上安装
yum localinstall https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm
或者
wget https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm
rpm -ivh mysql57-community-release-el7-11.noarch.rpm
  • 查看可选版本
yum repolist all | grep mysql
yum install -y mysql-community-server
  • 启动 MySQL
[root@master opt]# systemctl start mysqld
[root@master opt]# systemctl status mysqld
● mysqld.service - MySQL Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: active (running) since 一 2021-03-29 06:25:16 CST; 22h ago
     Docs: man:mysqld(8)
           http://dev.mysql.com/doc/refman/en/using-systemd.html
  Process: 49287 ExecStart=/usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid $MYSQLD_OPTS (code=exited, status=0/SUCCESS)
  Process: 49266 ExecStartPre=/usr/bin/mysqld_pre_systemd (code=exited, status=0/SUCCESS)
 Main PID: 49291 (mysqld)
    Tasks: 28
   CGroup: /system.slice/mysqld.service
           └─49291 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/my...

3月 29 06:25:13 master systemd[1]: Starting MySQL Server...
3月 29 06:25:16 master systemd[1]: Started MySQL Server.
  • 测试 MySQL
mysql -u root -p

enter 空密码,报错

报错参考https://blog.csdn.net/vv19910825/article/details/82979563
(操作:vim /etc/my.cnf, 添加 skip-grant-tables跳过密码 systemctl restart mysqld,下面改过密码了,在删除)

修改密码,注意满足密码强度

mysql> flush privileges; 
	-> alter user 'root'@'localhost' identified by '!@#123qwe';
    -> flush privileges;

vim /etc/my.cnf, 删除 skip-grant-tables
systemctl restart mysqld

mysql -u root -p 使用新密码,再次登录

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
4 rows in set (0.01 sec)
  • 在 MySQL 中创建 Hive 所需用户和数据库,并授权
mysql> create user 'dnn' identified by '!@#123Qwe';
mysql> create database hive;
Query OK, 1 row affected (0.02 sec)
mysql> grant all privileges on hive.* to 'dnn'@'localhost' identified by '!@#123Qwe';
mysql> flush privileges;

hive 表创建成功

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| hive               |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

3.2 安装 Hive

下载地址 : https://hive.apache.org/downloads.html

我下载的是 apache-hive-3.1.2-bin.tar.gz

tar -zxvf apache-hive-3.1.2-bin.tar.gz 

- 赋权限给普通用户 dnn

[root@master opt]# chown -R dnn /opt/hive3.1.2/
  • 复制 MySQL 的 JDBC 驱动包到 Hive 的 /lib 下
下载地址 https://dev.mysql.com/downloads/connector/j/

解压

tar -zxvf mysql-connector-java-5.1.49.tar.gz 

移动

[dnn@master Downloads]$ mv mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar /opt/hive3.1.2/lib
[dnn@master Downloads]$ rm -rf mysql-connector-java-5.1.49

3.3 配置 Hive

  • 配置 hive-env.sh
[dnn@master opt]$ cd /opt/hive3.1.2/conf/
[dnn@master conf]$ cp hive-env.sh.template  hive-env.sh
[dnn@master conf]$ vim hive-env.sh

第48行改为自己的路径 HADOOP_HOME=/opt/hadoop-3.3.0
第51行 export HIVE_CONF_DIR=/opt/hive3.1.2/conf
第54行 export HIVE_AUX_JARS_PATH=/opt/hive3.1.2/lib

  • 配置 hive-default.xml,直接复制模板
cp hive-default.xml.template hive-default.xml
  • 配置 vim hive-site.xml
<configuration>

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
ql.jdbc.Driver     To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore,com.mysql.jdbc.Driver is depricated</description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>dnn</value>
    <description>Username to use against metastore database</description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>!@#123Qwe</value>
    <description>password to use against metastore database</description>
  </property>
</configuration>

  • 初始化 Hive Metastore
    在 hive 的 bin 目录下 ./schematool -initSchema -dbType mysql

配置环境变量

vim ~/.bashrc

添加

export HIVE_HOME=/opt/hive3.1.2
export PATH=${HIVE_HOME}/bin:$PATH

使之生效

source ~/.bashrc 

4. 实践

启动 hadoop 命令:

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver
# 第三条可以用下面的命令,上面的显示过期了,以后弃用
mapred --daemon start historyserver

4.1 函数

显示函数 show functions;

hive> show functions;
OK
!
!=
$sum0
%
(省略。。。)

显示函数帮助信息:describe function upper(函数名);

hive> 
    > describe function upper;
OK
upper(str) - Returns str with all characters changed to uppercase
Time taken: 0.007 seconds, Fetched: 1 row(s)
  • 自定义函数,重写 UDF 中的 evaluate()
package michael_package;

import org.apache.hadoop.hive.ql.exec.UDF;

public class Sub extends UDF{
	public Integer evaluate(Integer a, Integer b) {
		if(a==null || b==null)
			return null;
		return a-b;
	}
	public Double evaluate(Double a, Double b) {
		if(a==null || b==null)
			return null;
		return a-b;
	}
	public Integer evaluate(Integer a, Integer[] b) {
		if(a==null || b==null)
			return null;
		int ans = a;
		for(int i = 0; i < b.length; ++i) {
			if(b[i] != null)
				ans -= b[i];
		}
		return ans;
	}
}
  • 表生成自定义函数 UDTF
  • 聚集自定义函数 UDAF

4.2 Hive Shell

  • create tableshow tablesdescribe 表name
hive> create table student(
    > id string,
    > name string,
    > sex string,
    > age tinyint,
    > dept string)
    > row format delimited fields terminated by '\t';
OK
Time taken: 0.464 seconds
hive> show tables;
OK
student
Time taken: 0.085 seconds, Fetched: 1 row(s)
hive> describe student;
OK
id                  	string              	                    
name                	string              	                    
sex                 	string              	                    
age                 	tinyint             	                    
dept                	string              	                    
Time taken: 0.069 seconds, Fetched: 5 row(s)
  • 编写个数据文件
hadoop@dblab-VirtualBox:/usr/local/eclipse$ vim /home/hadoop/workspace/student.txt
hadoop@dblab-VirtualBox:/usr/local/eclipse$ cat /home/hadoop/workspace/student.txt
1	michael	male	18	bigdata
2	ming	male	19	AI
3	lili	female	18	math
4	huahua	female	20	AI
  • 加载数据到表格
hive> load data local inpath '/home/hadoop/workspace/student.txt' into table student;
Loading data to table default.student
OK
Time taken: 1.158 seconds
  • select 操作表
hive> select * from student;
OK
1	michael	male	18	bigdata
2	ming	male	19	AI
3	lili	female	18	math
4	huahua	female	20	AI
Time taken: 0.789 seconds, Fetched: 4 row(s)
hive> select * from student where sex='male';
OK
1	michael	male	18	bigdata
2	ming	male	19	AI
Time taken: 1.565 seconds, Fetched: 2 row(s)

HiveQL 将命令转换为 MapReduce 操作

hive> select sex, count(*) from student group by sex;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20210331231151_0e76f173-ac54-46e0-8886-f018f8a7f480
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2021-03-31 23:11:52,943 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local236806530_0001
MapReduce Jobs Launched: 
Stage-Stage-1:  HDFS Read: 352 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
female	2
male	2
Time taken: 1.55 seconds, Fetched: 2 row(s)
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| hive               |
| mysql              |
| performance_schema |
| spark              |
| sys                |
+--------------------+
6 rows in set (0.00 sec)

mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hive            |
+---------------------------+
| AUX_TABLE                 |
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| COMPACTION_QUEUE          |
| COMPLETED_COMPACTIONS     |
| COMPLETED_TXN_COMPONENTS  |
| DATABASE_PARAMS           |
| DBS                       |
| DB_PRIVS                  |
| DELEGATION_TOKENS         |
| FUNCS                     |
| FUNC_RU                   |
| GLOBAL_PRIVS              |
| HIVE_LOCKS                |
| IDXS                      |
| INDEX_PARAMS              |
| KEY_CONSTRAINTS           |
| MASTER_KEYS               |
| NEXT_COMPACTION_QUEUE_ID  |
| NEXT_LOCK_ID              |
| NEXT_TXN_ID               |
| NOTIFICATION_LOG          |
| NOTIFICATION_SEQUENCE     |
| NUCLEUS_TABLES            |
| PARTITIONS                |
| PARTITION_EVENTS          |
| PARTITION_KEYS            |
| PARTITION_KEY_VALS        |
| PARTITION_PARAMS          |
| PART_COL_PRIVS            |
| PART_COL_STATS            |
| PART_PRIVS                |
| ROLES                     |
| ROLE_MAP                  |
| SDS                       |
| SD_PARAMS                 |
| SEQUENCE_TABLE            |
| SERDES                    |
| SERDE_PARAMS              |
| SKEWED_COL_NAMES          |
| SKEWED_COL_VALUE_LOC_MAP  |
| SKEWED_STRING_LIST        |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES             |
| SORT_COLS                 |
| TABLE_PARAMS              |
| TAB_COL_STATS             |
| TBLS                      |
| TBL_COL_PRIVS             |
| TBL_PRIVS                 |
| TXNS                      |
| TXN_COMPONENTS            |
| TYPES                     |
| TYPE_FIELDS               |
| VERSION                   |
| WRITE_SET                 |
+---------------------------+
57 rows in set (0.00 sec)
  • 非交互式,加载查询脚本
hadoop@dblab-VirtualBox:~/workspace$ vim script.q
hadoop@dblab-VirtualBox:~/workspace$ cat script.q 
select * from student;

hadoop@dblab-VirtualBox:~/workspace$ hive -f script.q 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Wed Mar 31 23:36:09 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:09 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:09 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:09 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:10 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:10 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:10 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Wed Mar 31 23:36:10 CST 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
OK
1	michael	male	18	bigdata
2	ming	male	19	AI
3	lili	female	18	math
4	huahua	female	20	AI
Time taken: 1.376 seconds, Fetched: 4 row(s)
  • -e 选项内嵌语句
    hadoop@dblab-VirtualBox:~/workspace$ hive -e 'select * from student'
已标记关键词 清除标记
相关推荐
<p> <span style="color:#3D3D3D;">本教程为授权出品</span> </p> <p> <span style="color:#3D3D3D;"><br /> </span> </p> <p> <span style="color:#3D3D3D;"><span style="color:#404040;">一、课程简介</span><br /> <span style="color:#404040;">数据仓库(Data Warehouse,可简写为DW或DWH),是面向分析的集成化数据环境,为企业决策制定过程,提供系统数据支持的战略集合,是国内外各大公司正在重点投入的战略级技术领域。</span></span> </p> <p> <span style="background-color:#FFFFFF;"><br /> </span><span style="color:#404040;">二、课程内容</span><br /> <span style="color:#404040;">《大数据电商数仓项目实战》视频教程,从项目架构的搭建,到数据采集模块的设计、数仓架构的设计、实战需求实现、即席查询的实现,我们针对国内目前广泛使用的Apache原生框架和CDH版本框架进行了分别介绍,Apache原生框架介绍中涉及到的技术框架包括Flume、Kafka、Sqoop、MySql、HDFS、Hive、Tez、Spark、Presto、Druid等,CDH版本框架讲解包括CM的安装部署、Hadoop、Zookeeper、Hive、Flume、Kafka、Oozie、Impala、HUE、Kudu、Spark的安装配置,透彻了解不同版本框架的区别联系,将大数据全生态系统前沿技术一网打尽。在过程中对大数据生态体系进行了系统的讲解,对实际企业数仓项目中可能涉及到的技术点都进行了深入的讲解和探讨。同时穿插了大量数仓基础理论知识,让你在掌握实战经验的同时能够打下坚实的理论基础。</span> </p> <p> <span style="background-color:#FFFFFF;"><br /> </span><span style="color:#404040;">三、课程目标</span><br /> <span style="color:#404040;">本课程以国内电商巨头实际业务应用场景为依托,对电商数仓的常见实战指标以及难点实战指标进行了详尽讲解,具体指标包括:每日、周、月活跃设备明细,留存用户比例,沉默用户、回流用户、流失用户统计,最近连续3周活跃用户统计,最近7天内连续3天活跃用户统计,GMV成交总额分析,转化率及漏斗分析,品牌复购率分析、订单表拉链表的设计等,让学生拥有更直观全面的实战经验。通过对本课程的学习,对数仓项目可以建立起清晰明确的概念,系统全面的掌握各项数仓项目技术,轻松应对各种数仓难题。</span> </p> <p> <span style="background-color:#FFFFFF;"><br /> </span><span style="color:#404040;">四、课程亮点</span><br /> <span style="color:#404040;">本课程结合国内多家企业实际项目经验,特别加入了项目架构模块,从集群规模的确定到框架版本选型以及服务器选型,手把手教你从零开始搭建大数据集群。并且总结大量项目实战中会遇到的问题,针对各个技术框架,均有调优实战经验,具体包括:常用Linux运维命令、Hadoop集群调优、Flume组件选型及性能优化、Kafka集群规模确认及关键参数调优。通过这部分学习,助学生迅速成长,获取前沿技术经验,从容解决实战问题。</span> </p> <p> <br /> </p>
©️2020 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie 返回首页