网站日志分析系统

首页 » Hadoop » 网站日志分析系统

BIG_DATA_LOG2代码下载

BIG_DATA_SXT_1代码下载

BIG_DATA_SXT_2代码下载

BIG_DATA_SXT_3代码下载

主要内容:

通过获取用户在网站的行为数据,并存储到相应数据库中,再进行相关分析,从而得到有价值的信息。本次实验主要从七个大的角度来进行分析,分别是:用户基本信息分析模块、浏览器信息分析模块、地域信息分析模块、用户浏览深度分析模块、外链数据分析模块、订单分析模块以及事件分析模块。

基本要求:

1、部署适合且完整的实验环境。
2、编写正确的代码文件。
3、选择合适的大数据组件来更好的完成实验。
4、对实验过程中出现的报错做到正确解决。
5、进一步强化对大数据各个组件的理解与运用。

主要参考资料:

[1]Tom White著 周敏奇等译《hadoop权威指南》清华大学出版社
[2]陈嘉,恒著《Hadoop实践》.机械工业出版社.2011
[3]吴刚山. 《基于Hadoop与Spark的大数据开发实战》人民邮电出版社, 2018.
[4]林子雨, 赖永炫, 陶继平《Spark编程基础》人民邮电出版社, 2018.

 摘要

在互联网应用中,日志是非常重要的数据,因为互联网项目往往要求是7*24小时不间断运行的,所以能获取到监控系统运行的相关日志数据并进行分析就显得尤为重要。网站流量统计时改进网站服务的重要手段之一,通过获取用户在网站的行为数据,进行分析,从而得到有价值的信息,并基于这些信息对网站进行改进。通过本系统可以省略人力,程序监控。对监控到数据储存到数据库中,再通过大数据相应组件把一个庞大且看似无用的数据进行分析得出相应结果供网站管理员对网站运营情况进行参考,并作出相应解决办法。本系统通过用户基本信息、浏览器信息、地域信息、用户浏览深度信息、外链数据、订单分析以及事件信息整合起来并存储到相应数据库组件,最后通过不同分析需求对数据进行提取、分析。             

第1章  系统需求分析

1.1  系统需求 

通过本系统可以实现以下功能:

  • 能够记录用户、浏览器、地域、浏览深度等信息。
  • 获取的数据保存到适合的数据库中。
  • 用户基本信息分析
  • 浏览器信息分析
  • 地域信息分析
  • 用户浏览深度分析
  • 外链数据分析
  • 订单分析
  • 事件分析

1.2  主体功能  

1、用户基本信息分析模块  
用户基本信息分析模块主要是从用户/访客和会员两个主要角度分析浏 览相关信息,包括但不限于新增用户、活跃用户、总用户、新增会员、活跃会员、总会员以及会话分析。
2、浏览器信息分析模块 
在用户基本信息分析的基础上添加一个浏览器这个维度信息。
3、地域信息分析模块 
主要分析各个不同省份的用户和会员情况以及不用地域的获取访客数量。
4、用户访问深度分析模块 
主要分析用户的访问记录的深度。
5、外链数据分析模块 
主要分析各个不同外链端带来的用户访问量数据。
6、订单数据分析模块 
主要分析订单的相关情况。
7、事件分析模块
在本次实验中,只选用订单相关的一个事件进行分析。

1.3  开发环境


三个CentOS 7.0系统的虚拟机。
Jdk 1.8 + Tomcat 8.5 + tengine-2.1 + flume-1.6 + Hadoop-2.5 + hbase-0.98.12 + zookeeper-3.4.7 + hive-2.1.1 + sqoop-1.4.6

第2章 系统概要设计 

2.1 js sdk执行工作流

网站日志分析系统

图2.1 js sdk执行工作流

2.2 数据流图

网站日志分析系统

图2.2 数据流图

2.3 模块与事件关系

其中launch事件是用来新用户指标查询的,而其他任务由pageview事件来完成。各个不同模块对用的事件如下表所示。

最终分析模块Js sdk事件
用户基本信息分析  pageview事件
浏览器信息分析
地域信息分析
外链数据分析
用户浏览深度分析
订单信息分析chargeRequest事件
事件分析event事件
新增用户Launch事件

表2.1 模块对应事件、

2.4 事件数据参数说明 

 在各个不同事件中收集不同的数据发送到nginx服务器,但是实际上这些收集到数据还是有一些共性的。下表是所用到的参数描述。

参数名称类型描述
enstring事件名称, eg: e_pv
verstring版本号, eg: 0.0.1
plstring平台, eg: website
sdkstringSdk类型, eg: js
b_rststring浏览器分辨率,eg: 1800*678
b_ievstring浏览器信息useragent
u_udstring用户/访客唯一标识符
lstring客户端语言
u_midstring会员id,和业务系统一致
u_sdstring会话id
c_timestring客户端时间
p_urlstring当前页面的url
p_refstring上一个页面的url
ttstring当前页面的标题
castringEvent事件的Category名称
acstringEvent事件的action名称
kv_*stringEvent事件的自定义属性
dustringEvent事件的持续时间
oidstring订单id
onstring订单名称
cuastring支付金额
cutstring支付货币类型
ptstring支付方式

表2.2 事件数据参数

a)Launch事件

当用户第一次访问网站的时候触发该事件,不提供对外调用的接口,只实现该事件的数据收集。

方法名称
发送的数据u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&c_time=1449137597974&ver=1&en=e_l&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN

表2.3 launch事件

b)Pageview事件

当用户访问页面/刷新页面的时候触发该事件。该事件会自动调用,也可以让程序员手动调用。

方法名称onPageView
发送的数据ver=1&en=e_pv&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN&u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&c_time=1449137597979&ht=www.bjsxt.com%3A8080&p_url=http%3A%2F%2Fwww.bjsxt.com%3A8080%2Fvst_track%2Findex.html

表2.4 pageview事件

c)chargeRequest事件

当用户下订单的时候触发该事件,该事件需要程序主动调用。

方法名称onChargeRequest
发送的数据u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&c_time=1449139048231&oid=orderid123&on=%E4%BA%A7%E5%93%81%E5%90%8D%E7%A7%B0&cua=1000&cut=%E4%BA%BA%E6%B0%91%E5%B8%81&pt=%E6%B7%98%E5%AE%9D&ver=1&en=e_crt&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN
参数类型是否必填描述
orderIdstring订单id
orderNameString产品购买描述名称
currencyAmountdouble订单价格
currencyTypeString货币类型
paymentTypeString支付方式

表2.5 chargerequest事件

d)Event事件 

当访客/用户触发业务定义的事件后,前端程序调用该方法。

方法名称onEventDuration
发送的数据ca=%E7%B1%BB%E5%9E%8B&ac=%E5%8A%A8%E4%BD%9C&c_time=1449139512665&u_sd=8E9559B3-DA35-44E1-AC98-85EB37D1F263&kv_p_url=http%3A%2F%2Fwwwbjsxt..com%3A8080%2Fvst_track%2Findex.html&kv_%E5%B1%9E%E6%80%A7key=%E5%B1%9E%E6%80%A7value&du=1000&ver=1&en=e_e&pl=website&sdk=js&b_rst=1920*1080&u_ud=12BF4079-223E-4A57-AC60-C1A04D8F7A2F&b_iev=Mozilla%2F5.0%20(Windows%20NT%206.1%3B%20WOW64)%20AppleWebKit%2F537.1%20(KHTML%2C%20like%20Gecko)%20Chrome%2F21.0.1180.77%20Safari%2F537.1&l=zh-CN
参数类型是否必填描述
categorystring自定义事件名称
actionString自定义事件动作
mapmap其他参数
durationlong事件持续时间

表2.6 event事件

  • 其他api方法

在这里只介绍设置会员id的方法,其他的辅助方法到时候编写js的时候再介绍。a)设置会员id

方法名称setMemberId
方法描述该方法的主要作用是设置会员id,当会员登录后,请调用该方法进行会员id的设置。
参数类型是否必填描述
memberidstring会员id

表2.7 其他api方法

第3章 SDK实现

 3.1 项目结构

网站日志分析系统

图 3.1 项目结构  

启动服务如下图。 

网站日志分析系统

图 3.2 启动服务

3.2 事件测试

1、浏览器输入http://localhost:8080/BIG_DATA_LOG2/demo.jsp测试。

网站日志分析系统

图3.3 测试页面1  

2、当点击demo2会触发pageview事件。

网站日志分析系统

图 3.4 测试页面2

3、测试demo3页面

网站日志分析系统

图3.5 测试页面3

4、测试demo4页面

网站日志分析系统

图3.6 测试页面4

第4章  nginx配置与测试

4.1 nginx.conf配置

#user  nobody; worker_processes  1; events {    worker_connections  1024; }http {    include       mime.types;     default_type  application/octet-stream;     log_format my_format ‘$remote_addr^A$msec^A$http_host^A$request_uri’;     sendfile        on;     keepalive_timeout  65;     server {         listen       80;         server_name  localhost;         location / {             root   html;             index  index.html index.htm;         }         location =/log.gif {             default_type image/gif;             access_log /opt/data/access.log my_format;         }         error_page   500 502 503 504  /50x.html;         location =/50x.html {             root   html;         }     }}

4.2 本地log文件

网站日志分析系统

图4.1通过测试页面传回来的日志数据

第5章 flume配置与测试

5.1 flume配置

目前本次实验已经可以收集了,因为数据最终要放到hdfs上,所以我选择了flume。具体配置如下。

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/access.log
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=hdfs://log/%Y%m%d
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.rollInterval=0
a1.sinks.k1.hdfs.rollSize=10240
a1.sinks.k1.hdfs.idleTimeout=5
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.callTimeout=40000
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.2 测试flume

测试flume会把数据写入到hdfs上。

网站日志分析系统

图5.1 测试flume是否把数据写入到hdfs上

第6章 ETL

6.1 项目结构

网站日志分析系统

图6.1 项目结构

6.2 测试过滤数据

当接收的到数据不符合规则的时候就过滤掉,符合的数据存储到hbase上。

网站日志分析系统

图6.2 测试过滤数据

6.3 查看hbase数据库

目前hbase里面的数据就不是脏数据了。

网站日志分析系统

图6.3 hbase里干净的数据

第7章 新增用户指标设计

7.1 新增用户指标设计思路

用户基本信息模块:新增用户(时间)
浏览器分析模块:新增用户(时间,浏览器信息)

2018-08-10   www.bjsxt.com   zhangsan  firefox-48
2018-08-10   www.bjsxt.com	 lisi	   firefox-53

MR设计思路
(1)map:
2018-08-10						zhangsan
2018-08-10,firefox-48			zhangsan
2018-08-10,firefox-all			zhangsan

2018-08-10						lisi 
2018-08-10,firefox-53			lisi
2018-08-10,firefix-all			lisi

(2)reduce:
2018-08-10						zhangsan		2
2018-08-10						lisi

2018-08-10,firefox-48			zhangsan		1

2018-08-10,firefox-53			lisi			1

2018-08-10,firefix-all			lisi			2
2018-08-10,firefox-all			zhangsan

模块当做其中一个条件
时间,用户基本信息模块
时间,浏览器,浏览器分析模块

2018-08-10   www.bjsxt.com   zhangsan  firefox-48
2018-08-10   www.bjsxt.com	 lisi	   firefox-53
MR设计思路
(1)map
2018-08-10,user      						zhangsan
2018-08-10,firefox-48,browser				zhangsan 
2018-08-10,firefix-all,browser				zhangsan 

2018-08-10,user      						lisi
2018-08-10,firefox-53,browser				lisi 
2018-08-10,firefix-all,browser				lisi 
(2)reduce
2018-08-10,user      						zhangsan		2
2018-08-10,user      						lisi

2018-08-10,firefox-48,browser				zhangsan 		1

2018-08-10,firefix-all,browser				zhangsan 		2
2018-08-10,firefix-all,browser				lisi 

2018-08-10,firefox-53,browser				lisi 			1

7.2 mysql表设计

本实训所用到的表的sql语句如下。

mysql表设计

DROP TABLE IF EXISTS `dimension_browser`;
CREATE TABLE `dimension_browser` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `browser_name` varchar(45) NOT NULL DEFAULT '' COMMENT '浏览器名称',
  `browser_version` varchar(255) NOT NULL DEFAULT '' COMMENT '浏览器版本号',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT='浏览器维度信息表';
DROP TABLE IF EXISTS `dimension_currency_type`;
CREATE TABLE `dimension_currency_type` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `currency_name` varchar(10) DEFAULT NULL COMMENT '货币名称',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT='支付货币类型维度信息表';
DROP TABLE IF EXISTS `dimension_date`;
CREATE TABLE `dimension_date` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `year` int(11) DEFAULT NULL,
  `season` int(11) DEFAULT NULL,
  `month` int(11) DEFAULT NULL,
  `week` int(11) DEFAULT NULL,
  `day` int(11) DEFAULT NULL,
  `calendar` date DEFAULT NULL,
  `type` enum('year','season','month','week','day') DEFAULT NULL COMMENT '日期格式',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT COMMENT='时间维度信息表';

由于篇幅问题,后面的代码放在文件供大家下载

mysql表设计

下图是创建成功后的表。

网站日志分析系统

图 7.1 查询表

7.3 项目结构

网站日志分析系统

图 7.2 项目结构

7.4 数据存入到mysql

执行mr后会把相应数据写入到mysql上供web端访问。

网站日志分析系统

图7.3 mysql上stats_device_browser表的数据

第8章 hive与hbase整合

8.1 sqoop的配置

sqoop在本次实训中的作用是把mysql里的数据与hive里的数据相互转换。修改配置configure-sqoop去掉未安装服务相关内容;例如(HBase、HCatalog、Accumulo):

#if [ ! -d "${HBASE_HOME}" ]; then
#  echo "Error: $HBASE_HOME does not exist!"
#  echo 'Please set $HBASE_HOME to the root of your HBase installation.'
#  exit 1

8.2 sqoop导入

sqoop:
导入:
import
--connect
jdbc:mysql://hadoop1:3306/test
--username
root
--password
123
--as-textfile
--columns
id,name,msg
--table
psn
--delete-target-dir
--target-dir
/sqoop/data
-m
1
命令:
sqoop --options-file sqoop1

导入:
import
--connect
jdbc:mysql://hadoop1/test
--username
root
--password
123
--as-textfile
--query
'select id, name, msg from psn where id like "1%" and $CONDITIONS'
--delete-target-dir
--target-dir
/sqoop/tmp
-m
1
--hive-home
/home/hive-1.2.1
--hive-import
--create-hive-table
--hive-table
t_test
导出:
export
--connect
jdbc:mysql://node1/test
--username
root
--password
123
-m
1
--columns
id,name,msg
--export-dir
/sqoop/data
--table
h_psn
网站日志分析系统

图8.1 hdfs里sqoop目录 

8.3 hive和hbase同步

1、把hive-hbase-handler-1.2.1.jar  cp到hbase/lib 下 ,同时把hbase中的所有的jar,cp到hive/lib  

2、在hive的配置文件增加属性: 

<property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1,node2,node3</value>
  </property>

 3、在hive中创建临时表 

CREATE EXTERNAL TABLE tmp_order 
(key string, id string, user_id string)  
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,order:order_id,order:user_id")  
TBLPROPERTIES ("hbase.table.name" = "t_order");

CREATE TABLE hbasetbl(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

只需要在hive和hbase里做好映射,两个数据库即可同步。注意hbase里的列名必须是经过映射的,不然不会同步。

网站日志分析系统

图8.1 hbase里t_order数据表

网站日志分析系统

图8.2 hive里t_order数据表

第9章 用户浏览深度

9.1 创建数据表

1. 在hive中创建hbase的event_log对应表

CREATE EXTERNAL TABLE event_logs(
key string, pl string, en string, s_time bigint, p_url string, u_ud string, u_sd string
) ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties('hbase.columns.mapping'=':key,log:pl,log:en,log:s_time,log:p_url,log:u_ud,log:u_sd')
tblproperties('hbase.table.name'='eventlog');

2. 创建mysql在hive中的对应表,hive中的表,执行HQL之后分析的结果保存该表,然后通过sqoop工具导出到mysql

CREATE TABLE `stats_view_depth` (
  `platform_dimension_id` bigint ,
  `data_dimension_id` bigint , 
  `kpi_dimension_id` bigint , 
  `pv1` bigint , 
  `pv2` bigint , 
  `pv3` bigint , 
  `pv4` bigint , 
  `pv5_10` bigint , 
  `pv10_30` bigint , 
  `pv30_60` bigint , 
  `pv60_plus` bigint , 
  `created` string
) row format delimited fields terminated by '\t';

3. hive创建临时表:把hql分析之后的中间结果存放到当前的临时表。

CREATE TABLE `stats_view_depth_tmp`(`pl` string, `date` string, `col` string, `ct` bigint);

4. 编写UDF(platformdimension & datedimension)<需要注意,要删除DimensionConvertClient类中所有FileSystem关闭的操作>

5. 上传transformer-0.0.1.jar到hdfs的/sxt/transformer文件夹中

6. 创建hive的function

#create function platform_convert as 'com.sxt.transformer.hive.PlatformDimensionUDF' using jar 'hdfs://sxt/sxt/transformer/transformer-0.0.1.jar';  
create function date_convert as 'com.sxt.transformer.hive.DateDimensionUDF' using jar 'hdfs://sxt/sxt/transformer/transformer-0.0.1.jar'; 

7. hql编写(统计用户角度的浏览深度)<注意:时间为外部给定>

from (
  select 
    pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_ud, 
    (case when count(p_url) = 1 then "pv1" 
      when count(p_url) = 2 then "pv2" 
      when count(p_url) = 3 then "pv3" 
      when count(p_url) = 4 then "pv4" 
      when count(p_url) >= 5 and count(p_url) <10 then "pv5_10" 
      when count(p_url) >= 10 and count(p_url) <30 then "pv10_30" 
      when count(p_url) >=30 and count(p_url) <60 then "pv30_60"  
      else 'pv60_plus' end) as pv 
  from event_logs 
  where 
    en='e_pv' 
    and p_url is not null 
    and pl is not null 
    and s_time >= unix_timestamp('2016-06-08','yyyy-MM-dd')*1000 
    and s_time < unix_timestamp('2016-06-09','yyyy-MM-dd')*1000
  group by 
    pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_ud
) as tmp
insert overwrite table stats_view_depth_tmp 
  select pl,day,pv,count(distinct u_ud) as ct where u_ud is not null group by pl,day,pv;

–把临时表的多行数据,转换一行 

with tmp as 
(
select pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all
select pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all
select pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all
select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all
select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all
select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all
select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all
select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all
select 'all' as pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all
select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all
select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all
select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all
select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all
select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all
select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'
)
from tmp
insert overwrite table stats_view_depth 
select 2,3,6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2017-01-10' group by pl,date1;

7. sqoop脚步编写(统计用户角度)

sqoop export --connect jdbc:mysql://hh:3306/report --username hive --password hive --table stats_view_depth --export-dir /hive/bigdater.db/stats_view_depth/* --input-fields-terminated-by "\01" --update-mode allowinsert --update-key platform_dimension_id,data_dimension_id,kpi_dimension_id

8. hql编写(统计会话角度的浏览深度)<注意:时间为外部给定>

from (
select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_sd, (case when count(p_url) = 1 then "pv1" when count(p_url) = 2 then "pv2" when count(p_url) = 3 then "pv3" when count(p_url) = 4 then "pv4" when count(p_url) >= 5 and count(p_url) <10 then "pv5_10" when count(p_url) >= 10 and count(p_url) <30 then "pv10_30" when count(p_url) >=30 and count(p_url) <60 then "pv30_60"  else 'pv60_plus' end) as pv 
from event_logs 
where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('2015-12-13','yyyy-MM-dd')*1000 and s_time < unix_timestamp('2015-12-14','yyyy-MM-dd')*1000
group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_sd
) as tmp
insert overwrite table stats_view_depth_tmp select pl,day,pv,count(distinct u_sd) as ct where u_sd is not null group by pl,day,pv;

with tmp as 
(
select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all
select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all
select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all
select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all
select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all
select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all
select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all
select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all
select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all
select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all
select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all
select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all
select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all
select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all
select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'
)
from tmp
insert overwrite table stats_view_depth select platform_convert(pl),date_convert(date),6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2015-12-13' group by pl,date;

9. sqoop脚步编写(统计会话角度)

sqoop export --connect jdbc:mysql://hh:3306/report --username hive --password hive --table stats_view_depth --export-dir /hive/bigdater.db/stats_view_depth/* --input-fields-terminated-by "\01" --update-mode allowinsert --update-key platform_dimension_id,data_dimension_id,kpi_dimension_id

— 1. 在hive中创建hbase的event_logs对应表

CREATE EXTERNAL TABLE event_logs(row string, pl string, en string, s_time bigint, p_url string, u_ud string, u_sd string, ca string, ac string, oid string, `on` string, cua bigint, cut string, pt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties('hbase.columns.mapping'=':key,info:pl,info:en,info:s_time,info:p_url,info:u_ud,info:u_sd,info:ca,info:ac, info:oid,info:on,info:cua,info:cut,info:pt')
tblproperties('hbase.table.name'='eventlog');

— 2. 自定义UDF(CurrencyType&PaymentType)

— 3. 上传transformer-0.0.1.jar到linux集群和hdfs集群上,然后启动DimensionConverterServer服务

— 4. 创建function

create function currency_type_convert as ‘com.sxt.transformer.hive.CurrencyTypeDimensionUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;create function payment_type_convert as ‘com.sxt.transformer.hive.PaymentTypeDimensionUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;create function order_info as ‘com.sxt.transformer.hive.OrderInfoUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’;create function order_total_amount as ‘com.sxt.transformer.hive.OrderTotalAmountUDF’ using jar ‘hdfs://node1:8020/sxt/transformer/transformer-0.0.1.jar’; 

— 5. 创建临时表

create table stats_order_tmp1(pl string, dt string, cut string, pt string, values bigint);CREATE TABLE `stats_order_tmp2` (`platform_dimension_id` bigint ,`date_dimension_id` bigint , `currency_type_dimension_id` bigint ,`payment_type_dimension_id` bigint , `values` bigint, `created` string); 

— 6. 保存订单数据到mysql中      

创建hive中间表

CREATE TABLE `order_info` (`order_id` string,`platform` string,`s_time` bigint ,`currency_type` string ,`payment_type` string , `amount` bigint);     

hql插入到hive表

from event_logsinsert overwrite table order_info select oid,pl,s_time,cut,pt,cuawhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000或者from (select oid,pl,s_time,cut,pt,cua from event_logswhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000) as tmpinsert overwrite table order_info select oid,pl,s_time,cut,pt,cua     

sqoop脚本

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table order_info –export-dir /hive/bigdater.db/order_info/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key order_id  

— 7. 订单数量hql(总的)

from (select pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,cut,pt,count(distinct oid) as ordersfrom event_logswhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),cut,pt) as tmpinsert overwrite table stats_order_tmp1 select pl,date,cut,pt,ordersinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),orders,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as orders,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dt,cut 

— 8. 订单数量sqoop(总的)

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,orders,created  

— 9. 订单金额hql(总的)

from (from (select pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,cut,pt,oid,max(cua) as amountfrom event_logswhere en=’e_crt’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),cut,pt,oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dt,cut 

— 10.订单金额sqoop(总的)

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,order_amount,created 

 — 11. 成功支付订单数量hql

from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,count(distinct oid) as ordersfrom event_logswhere en=’e_cs’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’)) as tmpinsert overwrite table stats_order_tmp1 select pl,date,cut,pt,ordersinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),orders,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as orders,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dt,cut

 — 12. 成功支付订单数量sqoop

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,success_orders,created 

— 13. 成功支付订单金额hql

from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cs’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dt,cut

 — 14. 成功支付订单金额sqoop

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,revenue_amount,created 

— 15. 成功支付订单迄今为止总金额hql

from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cs’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),cast(amount as int),’revenue’) as amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),cast(sum(values) as int),”revenue”) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),”revenue”) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),”revenue”) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),”revenue”) as amount,dt group by pl,dt,cut 

— 16. 成功支付订单迄今为止总金额sqoop

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,total_revenue_amount,created  — 17. 退款订单数量hql

from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,count(distinct oid) as ordersfrom event_logswhere en=’e_cr’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’)) as tmpinsert overwrite table stats_order_tmp1 select pl,date,cut,pt,ordersinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),orders,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as orders,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as orders,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as orders,dt group by pl,dt,cut

 — 18. 退款订单数量sqoop

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,refund_orders,created 

— 19. 退款订单金额hql

from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cr’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),amount,date from stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),sum(values) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),sum(values) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),sum(values) as amount,dt group by pl,dt,cut

— 20. 退款订单金额sqoop

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,refund_amount,created

— 21. 退款订单迄今为止总金额hql

from (from (select order_info(oid,’pl’) as pl,from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’) as date,order_info(oid,’cut’) as cut,order_info(oid,’pt’) as pt,oid,max(order_info(oid)) as amountfrom event_logswhere en=’e_cr’ and pl is not null and s_time >= unix_timestamp(‘2015-12-13′,’yyyy-MM-dd’)*1000 and s_time < unix_timestamp(‘2015-12-14′,’yyyy-MM-dd’)*1000group by order_info(oid,’pl’),from_unixtime(cast(s_time/1000 as bigint),’yyyy-MM-dd’),order_info(oid,’cut’),order_info(oid,’pt’),oid) as tmpselect pl,date,cut,pt,sum(amount) as amount group by pl,date,cut,pt) as tmp2insert overwrite table stats_order_tmp1 select pl,date,cut,pt,amountinsert overwrite table stats_order_tmp2 select platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(date),currency_type_convert(cut),payment_type_convert(pt),cast(amount as int),’refund’) as amount,datefrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(pt),cast(sum(values) as int),’refund’) as amount,dt group by dt,cut,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),’refund’) as amount,dt group by dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by dt,cutfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(‘all’),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(pt),cast(sum(values) as int),’refund’) as amount,dt group by pl,dt,ptfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(‘all’),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by pl,dtfrom stats_order_tmp1 insert into table stats_order_tmp2 select platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),order_total_amount(platform_convert(pl),date_convert(dt),currency_type_convert(cut),payment_type_convert(‘all’),cast(sum(values) as int),’refund’) as amount,dt group by pl,dt,cut

— 22. 退款订单迄今为止总金额sqoop

sqoop export –connect jdbc:mysql://node1:3306/report –username hive –password hive –table stats_order –export-dir /hive/bigdater.db/stats_order_tmp2/* –input-fields-terminated-by “\\01” –update-mode allowinsert –update-key platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id –columns platform_dimension_id,date_dimension_id,currency_type_dimension_id,payment_type_dimension_id,total_refund_amount,created 下图是浏览深度(p1\p2\p3…….)

网站日志分析系统

图 9.1 用户浏览深度  

网站日志分析系统

图 9.2 stats_view_depth数据表

网站日志分析系统

图9.3 stats_view_depth_tmps数据表

第10章 项目调优

一、调优的目的
充分的利用机器的性能,更快的完成mr程序的计算任务。甚至是在有限的机器条件下,能够支持运行足够多的mr程序。
二、调优的总体概述
从mr程序的内部运行机制,我们可以了解到一个mr程序由mapper和reducer两个阶段组成,其中mapper阶段包括数据的读取、map处理以及写出操作(排序和合并/sort&merge),而reducer阶段包含mapper输出数据的获取、数据合并(sort&merge)、reduce处理以及写出操作。那么在这七个子阶段中,能够进行较大力度的进行调优的就是map输出、reducer数据合并以及reducer个数这三个方面的调优操作。也就是说虽然性能调优包括cpu、内存、磁盘io以及网络这四个大方面,但是从mr程序的执行流程中,我们可以知道主要有调优的是内存、磁盘io以及网络。在mr程序中调优,主要考虑的就是减少网络传输和减少磁盘IO操作,故本次课程的mr调优主要包括服务器调优、代码调优、mapper调优、reducer调优以及runner调优这五个方面。
三、服务器调优
服务器调优主要包括服务器参数调优和jvm调优。在本次项目中,由于我们使用hbase作为我们分析数据的原始数据存储表,所以对于hbase我们也需要进行一些调优操作。除了参数调优之外,和其他一般的java程序一样,还需要进行一些jvm调优。

hdfs调优
1. dfs.datanode.failed.volumes.tolerated: 允许发生磁盘错误的磁盘数量,默认为0,表示不允许datanode发生磁盘异常。当挂载多个磁盘的时候,可以修改该值。
2. dfs.replication: 复制因子,默认3
3. dfs.namenode.handler.count: namenode节点并发线程量,默认10
4. dfs.datanode.handler.count:datanode之间的并发线程量,默认10。
5. dfs.datanode.max.transfer.threads:datanode提供的数据流操作的并发线程量,默认4096。
一般将其设置为linux系统的文件句柄数的85%~90%之间,查看文件句柄数语句ulimit -a,修改vim /etc/security/limits.conf, 不能设置太大
文件末尾,添加
* soft nofile 65535
* hard nofile 65535
注意:句柄数不能够太大,可以设置为1000000以下的所有数值,一般不设置为-1。
异常处理:当设置句柄数较大的时候,重新登录可能出现unable load session的提示信息,这个时候采用单用户模式进行修改操作即可。
单用户模式:
启动的时候按’a’键,进入选择界面,然后按’e’键进入kernel修改界面,然后选择第二行’kernel…’,按’e’键进行修改,在最后添加空格+single即可,按回车键回到修改界面,最后按’b’键进行单用户模式启动,当启动成功后,还原文件后保存,最后退出(exit)重启系统即可。
6. io.file.buffer.size: 读取/写出数据的buffer大小,默认4096,一般不用设置,推荐设置为4096的整数倍(物理页面的整数倍大小)。

hbase调优
1. 设置regionserver的内存大小,默认为1g,推荐设置为4g。
修改conf/hbase-env.sh中的HBASE_HEAPSIZE=4g
2. hbase.regionserver.handler.count: 修改客户端并发线程数,默认为10。设置规则为,当put和scans操作比较的多的时候,将其设置为比较小的值;当get和delete操作比较多的时候,将其设置为比较大的值。原因是防止频繁GC操作导致内存异常。
3. 自定义hbase的分割和紧缩操作,默认情况下hbase的分割机制是当region大小达到hbase.hregion.max.filesize(10g)的时候进行自动分割,推荐每个regionserver的region个数在20~500个为最佳。hbase的紧缩机制是hbase的一个非常重要的管理机制,hbase的紧缩操作是非常消耗内存和cpu的,所以一般机器压力比较大的话,推荐将其关闭,改为手动控制。
4. hbase.balancer.period: 设置hbase的负载均衡时间,默认为300000(5分钟),在负载比较高的集群上,将其值可以适当的改大。
5. hfile.block.cache.size:修改hflie文件块在内存的占比,默认0.4。在读应用比较多的系统中,可以适当的增大该值,在写应用比较多的系统中,可以适当的减少该值,不过不推荐修改为0。
6. hbase.regionserver.global.memstore.upperLimit:修改memstore的内存占用比率上限,默认0.4,当达到该值的时候,会进行flush操作将内容写的磁盘中。
7. hbase.regionserver.global.memstore.lowerLimit: 修改memstore的内存占用比率下限,默认0.38,进行flush操作后,memstore占用的内存比率必须不大于该值。
8. hbase.hregion.memstore.flush.size: 当memstore的值大于该值的时候,进行flush操作。默认134217728(128M)。
9. hbase.hregion.memstore.block.multiplier: 修改memstore阻塞块大小比率值,默认为4。也就是说在memstore的大小超过4*hbase.hregion.memstore.flush.size的时候就会触发写阻塞操作。最终可能会导致出现oom异常。

mapreduce调优
1. mapreduce.task.io.sort.factor: mr程序进行合并排序的时候,打开的文件数量,默认为10个.
2. mapreduce.task.io.sort.mb: mr程序进行合并排序操作的时候或者mapper写数据的时候,内存大小,默认100M
3. mapreduce.map.sort.spill.percent: mr程序进行flush操作的阀值,默认0.80。
4. mapreduce.reduce.shuffle.parallelcopies:mr程序reducer copy数据的线程数,默认5。
5. mapreduce.reduce.shuffle.input.buffer.percent: reduce复制map数据的时候指定的内存堆大小百分比,默认为0.70,适当的增加该值可以减少map数据的磁盘溢出,能够提高系统性能。
6. mapreduce.reduce.shuffle.merge.percent:reduce进行shuffle的时候,用于启动合并输出和磁盘溢写的过程的阀值,默认为0.66。如果允许,适当增大其比例能够减少磁盘溢写次数,提高系统性能。同mapreduce.reduce.shuffle.input.buffer.percent一起使用。
7. mapreduce.task.timeout:mr程序的task执行情况汇报过期时间,默认600000(10分钟),设置为0表示不进行该值的判断。

四、代码调优
代码调优,主要是mapper和reducer中,针对多次创建的对象,进行代码提出操作。这个和一般的java程序的代码调优一样。
五、mapper调优
mapper调优主要就是就一个目标:减少输出量。我们可以通过增加combine阶段以及对输出进行压缩设置进行mapper调优。
combine介绍:
实现自定义combine要求继承reducer类,特点:
以map的输出key/value键值对作为输入输出键值对,作用是减少网络输出,在map节点上就合并一部分数据。
比较适合,map的输出是数值型的,方便进行统计。
压缩设置:
在提交job的时候分别设置启动压缩和指定压缩方式。
六、reducer调优
reducer调优主要是通过参数调优和设置reducer的个数来完成。
reducer个数调优:
要求:一个reducer和多个reducer的执行结果一致,不能因为多个reducer导致执行结果异常。
规则:一般要求在hadoop集群中的执行mr程序,map执行完成100%后,尽量早的看到reducer执行到33%,可以通过命令hadoop job -status job_id或者web页面来查看。
原因: map的执行process数是通过inputformat返回recordread来定义的;而reducer是有三部分构成的,分别为读取mapper输出数据、合并所有输出数据以及reduce处理,其中第一步要依赖map的执行,所以在数据量比较大的情况下,一个reducer无法满足性能要求的情况下,我们可以通过调高reducer的个数来解决该问题。
优点:充分利用集群的优势。
缺点:有些mr程序没法利用多reducer的优点,比如获取top n的mr程序。
七、runner调优
runner调优其实就是在提交job的时候设置job参数,一般都可以通过代码和xml文件两种方式进行设置。
1~8详见ActiveUserRunner(before和configure方法),9详解TransformerBaseRunner(initScans方法)

1. mapred.child.java.opts: 修改childyard进程执行的jvm参数,针对map和reducer均有效,默认:-Xmx200m
2. mapreduce.map.java.opts: 需改map阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。
3. mapreduce.reduce.java.opts:修改reducer阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。
4. mapreduce.job.reduces: 修改reducer的个数,默认为1。可以通过job.setNumReduceTasks方法来进行更改。
5. mapreduce.map.speculative:是否启动map阶段的推测执行,默认为true。其实一般情况设置为false比较好。可通过方法job.setMapSpeculativeExecution来设置。
6. mapreduce.reduce.speculative:是否需要启动reduce阶段的推测执行,默认为true,其实一般情况设置为fase比较好。可通过方法job.setReduceSpeculativeExecution来设置。
7. mapreduce.map.output.compress:设置是否启动map输出的压缩机制,默认为false。在需要减少网络传输的时候,可以设置为true。
8. mapreduce.map.output.compress.codec:设置map输出压缩机制,默认为org.apache.hadoop.io.compress.DefaultCodec,推荐使用SnappyCodec(在之前版本中需要进行安装操作,现在版本不太清楚,安装参数:http://www.cnblogs.com/chengxin1982/p/3862309.html)
9. hbase参数设置
由于hbase默认是一条一条数据拿取的,在mapper节点上执行的时候是每处理一条数据后就从hbase中获取下一条数据,通过设置cache值可以一次获取多条数据,减少网络数据传输。

第11章 课程设计总结

本次课程设计的选题是我第二次接触了,但是做完整个实训我还是收获了很多,因为这是我自学Hadoop组件的第一次项目实训;此实训来源于北京尚学堂1903班大数据全套课程里三个实训项目中的一个实训项目。在本学期通过老师的教学让我有效的补充当初自学的漏洞,而且本次实训让我深刻了解到了自己还有好多不足之处,在实训过程中也遇到好多问题,由于本次实训代码是上学堂写好的,但是我每理解一段代码的时候也能深刻感觉到这次实训并不容易,里面好多代码自己需要理解好久才能想明白,对于项目中的代码规范自己还并不熟悉,因为平常写代码就是不按照规范编写,所以在理解这些大量运用设计模式的代码很吃力。在本次实训的环境配置中也遇到了好多问题,一些是自己大意写错了配置,一些是自己对网络以及这些组件的原理理解不够,导致在集群上配置文件时不清楚配置的哪个。这次实训把Hadoop一些组件整合起来,能够把他们正确运行并且数据能够共享也给我这次实训造成了一些问题。回顾本次实训,我能深刻的认识到自己的不足,的确,虽然是第二次实验,但是还是会出现一些以前没有出现的错误,这些错误都是自己的失误和无知造成的,所以,在以后的学习道路上,我更需要努力,让自己的能力得到更大的提升。                          

参考文献

[1]Tom White著 周敏奇等译《hadoop权威指南》清华大学出版社[2]陈嘉,恒著《Hadoop实践》.机械工业出版社.2011
[3]吴刚山. 《基于Hadoop与Spark的大数据开发实战》人民邮电出版社, 2018. [4]林子雨, 赖永炫, 陶继平《Spark编程基础》人民邮电出版社, 2018. 
[5]孟小峰,慈祥.《大数据管理:概念、技术与挑战[J]》.计算机研究与发展,2013,
[6]陈吉荣,乐嘉锦《基于Hadoop生态系统的大数据解决方案综述》计算机工程与科学
分享到:
赞(0) 打赏

评论 抢沙发

评论前必须登录!

 



觉得文章有用就打赏一下弟弟吧

支付宝扫一扫打赏

微信扫一扫打赏

Vieu4.5主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。
正在播放:

作者想对您说:

累了就停下来听首歌吧

听完后会给您一个好心情

最后

等到您不容易

还希望您能多待一会儿

      00:00/00:00

      登陆功能正在开发中...

      请您使用第三方帐号快捷登录

      Q Q 登 录
      微 博 登 录