Spark 日志分析-日志文件预处理(清洗过滤、数据解析、数据集成、数据修正、json数据扁平化)
迪丽瓦拉
2024-05-31 09:28:29
0

日志数据的采集:flume 落盘到HDFS,以天为单位,一天 一个文件夹
spark:对日志文件进行处理、加工、最后的落盘还是 HDFS,只不过 换了一个文件夹,处理 后的文件映射到 数仓中的ODS的表
或者
hive+udf:对日志文件进行处理、加工、最后的落盘还是 HDFS,只不过 换了一个文件夹,处理 后的文件映射到 数仓中的ODS的表

spark 预处理过程:扁平化,fastJson类库event字段转换成Map其他都变成 字段
业务规则 :
过滤掉日志中: uid|uuid|mac|imei|androidId|imsi 全为空的记录
过滤掉缺少关键字段的记录,eventid、sessionId、event 缺任何一个都不可
使用广播变量,将经纬度信息广播出去。
使用Map的get获取相应的省市区信息

hive做预处理:映射成表:line-->json
扁平化

get_json_object(line,'$.eventid') as eventid,
str_to_map(regexp_replace(get_json_object(line,'$.
event'),'\\{|\\}|\\"','')) as event,
get_json_object(line,'$.user.phone.mac') as mac

业务规则:where 指定条件
经纬度转换:join geohash编码 UDF

日志的预处理:

清洗过滤:
        1.去除json中废弃的字段,email、phoneNbr、gender、isLogin、addr、isRegister
        2.过滤掉日志中: uid|uuid|mac|imei|androidId|imsi 全为空的记录
        3.过滤掉缺少关键字段的记录,eventid、sessionId、event 缺任何一个都不可
        4.过滤掉jons格式不正确的记录

数据解析:
        将json格式扁平化,event字段保留,不需要扁平化

数据集成:
        1.将日志中的经纬度信息解析成省、市、区(县)为方便进行地域维度的分析、
        2.ip地址的映射
        3.集成商圈信息

数据修正:
        1.guid回补
        2.字段名称规范化
        3.字段度量规范化:时间戳统一用秒级
        4.字段类型规范:时间戳统一用长整型

json数据扁平化:
        日志文件格式:json格式 数据有嵌套,表用map

例子:

pom.xml文件:


4.0.0org.exampleLogTest1.0-SNAPSHOTscala-demo-projecthttp://www.example.comUTF-81.81.8org.scala-langscala-library2.11.8org.apache.kafkakafka_2.110.11.0.2org.apache.commonscommons-dbcp22.1.1mysqlmysql-connector-java5.1.47org.apache.sparkspark-sql_2.112.2.0org.apache.sparkspark-streaming_2.112.2.0org.apache.sparkspark-streaming-kafka-0-10_2.112.2.0org.scalikejdbcscalikejdbc_2.113.1.0org.scalikejdbcscalikejdbc-config_2.113.1.0org.apache.sparkspark-sql_2.112.2.0org.apache.sparkspark-hive_2.112.2.0org.apache.sparkspark-graphx_2.112.2.0com.alibabafastjson1.2.69ch.hsrgeohash1.3.0org.mongodb.sparkmongo-spark-connector_2.112.2.0com.alibabadruid1.1.10redis.clientsjedis2.9.3org.apache.logging.log4jlog4j-api-scala_2.1111.0net.alchim31.mavenscala-maven-plugin3.2.2compiletestCompileorg.apache.maven.pluginsmaven-shade-plugin2.4.3packageshadecn.kgc.kafak.demo.ThreadProducer*:*META-INF/*.SFNF/*.DSAMETA-INF/*.RSA

数据集成:
1、连接mysql创建码表geomap

geomap.sql :
/*
SQLyog Ultimate v12.09 (64 bit)
MySQL - 5.6.50 
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;create table `geomap` (`lag` float ,`lat` float ,`province` varchar (120),`city` varchar (120),`district` varchar (120)
); 
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.41','39.9316','北京','北京市','东城区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.36','39.9305','北京','北京市','西城区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.485','39.9484','北京','北京市','朝阳区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.286','39.8585','北京','北京市','丰台区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.223','39.9056','北京','北京市','石景山区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.298','39.9593','北京','北京市','海淀区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.101','39.9404','北京','北京市','门头沟区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.143','39.7479','北京','北京市','房山区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.657','39.9097','北京','北京市','通州区');
insert into `geomap` (`lag`, `lat`, `province`, `city`, `district`) values('116.654','40.1302','北京','北京市','顺义区');连接mysql到自己创建好的库下执行(导入数据)
mysql> source /root/tmp_data/geomap.sql

2、编写数据集成脚本(GeoMapDemo.scala) 运行生成文件到本地

import java.util.Properties
import ch.hsr.geohash.GeoHash
import org.apache.spark.sql.SparkSession
object GeoMapDemo { //码表def main(args: Array[String]): Unit = {// 创建sparksession/// 将经纬度进行编码,5位,东城区 geo码是一样的val spark = SparkSession.builder().appName(this.getClass.getName).master("local[*]").getOrCreate()// 用spark读取mysql数据中的信息val props = new Properties()props.put("user", "root")props.put("password", "123")val df = spark.read.jdbc("jdbc:mysql://192.168.58.203:3306/testdb", "geomap", props)import spark.implicits._// 外部的集成信息 一般都是作为广播变量处理的/*** +---------+---------+----------+-----------+--------------+* | lag     | lat     | province | city      | district     |* +---------+---------+----------+-----------+--------------+* |  116.41 | 39.9316 | 北京     | 北京市    | 东城区       |* |  116.36 | 39.9305 | 北京     | 北京市    | 西城区       |* | 116.485 | 39.9484 | 北京     | 北京市    | 朝阳区       |* | 116.286 | 39.8585 | 北京     | 北京市    | 丰台区       |* | 116.223 | 39.9056 | 北京     | 北京市    | 石景山区     |* | 116.298 | 39.9593 | 北京     | 北京市    | 海淀区       |* | 116.101 | 39.9404 | 北京     | 北京市    | 门头沟区     |* | 116.143 | 39.7479 | 北京     | 北京市    | 房山区       |* | 116.657 | 39.9097 | 北京     | 北京市    | 通州区       |* | 116.654 | 40.1302 | 北京     | 北京市    | 顺义区       |* +---------+---------+----------+-----------+--------------+*/val result = df.map(row => {val lng = row.getAs[Double]("lag");val lat = row.getAs[Double]("lat");val province = row.getAs[String]("province");val city = row.getAs[String]("city");val district = row.getAs[String]("district");//根据经纬度计算码值geohashval geoCode = GeoHash.geoHashStringWithCharacterPrecision(lat, lng, 5) //拿到经纬度编码 5定的是几位 位数越多越好//返回一个tuple(geoCode, province, city, district)}).toDF("geo", "province", "city", "district")  //转换成字典// result.show(100,false)/**  最终上面的数据变成下面的数据*  key             value*  geohash          省市区** |  geo|province|city|district|* +-----+--------+----+--------+* |wx4g0|      北京| 北京市|     东城区|* |wx4ep|      北京| 北京市|     西城区|* |wx4g6|      北京| 北京市|     朝阳区|* |wx4dy|      北京| 北京市|     丰台区|* |wx4eh|      北京| 北京市|    石景山区|* |wx4eq|      北京| 北京市|     海淀区|* |wx4e1|      北京| 北京市|    门头沟区|* |wx4d4|      北京| 北京市|     房山区|* |wx4gn|      北京| 北京市|     通州区|* |wx4uq|      北京| 北京市|     顺义区|* +-----+--------+----+--------+*/result.write.parquet("file:///E:/vmwork/JavaTest/logtest/data/dict/geo_dict/output") //这是输出到本地//result.write.parquet("hdfs:///user/data/dict/geo_dict/output")  //输出到hdfs上spark.close()}
}

3、连接mysql创建idmp表

idmp.sql :
/*
SQLyog Ultimate v12.09 (64 bit)
MySQL - 5.6.50
*********************************************************************
*/^M
/*!40101 SET NAMES utf8 */;create table `idmp` (^M`id_hashcode` BIGINT,^M`guid` BIGINT
); ^M
insert into `idmp` (`id_hashcode`, `guid`) values('-1725068780','111');^M
insert into `idmp` (`id_hashcode`, `guid`) values('1784987109','222');^M连接mysql到自己创建好的库下执行(导入数据)
mysql>source /root/tmp_data/idmp.sql

4、编写脚本用于下一步回补guid(IdmpDemo.scala) 运行生成文件到本地

import java.util.Properties
import org.apache.spark.sql.SparkSessionobject IdmpDemo {def main(args: Array[String]): Unit = {// 创建sparksessionval spark = SparkSession.builder().appName(this.getClass.getName).master("local[*]").getOrCreate()// 用spark读取mysql数据中的信息val props = new Properties()props.put("user", "root")props.put("password", "123")val df = spark.read.jdbc("jdbc:mysql://192.168.58.203:3306/testdb", "idmp", props)import spark.implicits._// 外部的集成信息 一般都是作为广播变量处理的val result = df.map(row => {val id_hashcode = row.getAs[Long]("id_hashcode");val guid = row.getAs[Long]("guid");(id_hashcode,guid)}).toDF("id_hashcode","guid")result.write.parquet("file:///E:/vmwork/JavaTest/logtest/data/dict/idmp/output") //这是输出到本地// result.write.parquet("hdfs:///user/data/dict/geo_dict/output")  //输出到hdfs上spark.close()}
}

清洗过滤、数据解析、数据集成、数据修正、json数据扁平化:

在本地创建app.log 写入数据:

{"eventid":"webStayEvent","event":{"pgid":"790","title":"","url":"http://www.kgcedu.cn/aca/pg790"},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575526150000"}
{"eventid":"pgviewEvent","event":{"pgid":"650","title":"","url":"http://www.kgcedu.cn/acb/pg650","referrer_host":"http://www.sina.com","referrer":"","utm_campaign":"","utm_source":"","utm_medium":"","utm_term":"","utm_content":""},"user":{"uid":"019415","account":"","email":"","phoneNbr":"13444631462","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"2739315947046853","mac":"85-94-a5-10-61-ea-6e","imsi":"3470586308518905","osName":"macos","osVer":"10.0","androidId":"","resolution":"800*600","deviceType":"MEIZU_ML5","deviceId":"IKy8C2","uuid":"uI595WmqORoTB7aE"},"app":{"appid":"cn.kgc.mall","appVer":"2.0.1","release_ch":"柠檬助手","promotion_ch":"01"},"loc":{"areacode":110114004,"longtitude":116.2965574090751,"latitude":40.155567576333496,"carrier":"ISP07","netType":"N","cid_sn":"235782374801","ip":"141.46.13.59"},"sessionId":"sid-a533717b-af14-4ddf-8d2d-38beb904bf82"},"timestamp":"1575526083000"}
{"eventid":"webStayEvent","event":{"pgid":"790","title":"","url":"http://www.kgcedu.cn/aca/pg790"},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575526153000"}
{"eventid":"webStayEvent","event":{"pgid":"790","title":"","url":"http://www.kgcedu.cn/aca/pg790"},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575526154000"}
{"eventid":"viewConentDetailEvent","event":{"pgId":"790","contentType":"","contentID":"00279","contentTile":"","contentChannel":"","contentTag":""},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575526558000"}
{"eventid":"viewConentDetailEvent","event":{"pgId":"790","contentType":"","contentID":"00279","contentTile":"","contentChannel":"","contentTag":""},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575526909000"}
{"eventid":"viewConentDetailEvent","event":{"pgId":"790","contentType":"","contentID":"00279","contentTile":"","contentChannel":"","contentTag":""},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575527404000"}
{"eventid":"adShowEvent","event":{"adId":"1","pgId":"790","adPosition":"","adType":"","adTitle":"","adSource":"","adResourceID":""},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575529211000"}
{"eventid":"adClickEvent","event":{"adId":"7","pgId":"790","adPosition":"","adType":"","adTitle":"","adSource":"","adResourceID":""},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575529309000"}
{"eventid":"viewConentDetailEvent","event":{"pgId":"267","contentType":"","contentID":"00583","contentTile":"","contentChannel":"","contentTag":""},"user":{"uid":"609803","account":"","email":"","phoneNbr":"18998320872","birthday":"","isRegistered":"","isLogin":"","addr":"","gender":"","phone":{"imei":"8534282958025218","mac":"88-2c-e6-25-db-74-87","imsi":"8904769954837648","osName":"android","osVer":"10.0","androidId":"28616e716a8a1476","resolution":"1356*768","deviceType":"VIVO_MATE","deviceId":"","uuid":"cAOUyycgbGLjpBjs"},"app":{"appid":"cn.kgc.mall","appVer":"1.9.0","release_ch":"OPPO软件商店","promotion_ch":"12"},"loc":{"areacode":361030203,"longtitude":116.41679639900781,"latitude":26.608591452880322,"carrier":"ISP04","netType":"4G","cid_sn":"551564839366","ip":"183.102.214.100"},"sessionId":"sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617"},"timestamp":"1575530061000"}

可以此网站在线解析json查看数据:JSON在线解析,JSON格式化,JSON解析,JSON 校验(SO JSON)

5、编辑脚本选取需要的字段(LogDataDemo.scala)

case class LogDataDemo(//要选取的数据字段var guid:Long,eventid:String,event:Map[String,String],uid:String,imei:String,mac:String,imsi:String,osName:String,osVer:String,androidId:String,resolution:String,deviceType:String,deviceId:String,uuid:String,appid:String,appVer:String,release_ch:String,promotion_ch:String,areacode:String,longtitude:Double,latitude:Double,carrier:String,netType:String,cid_sn:String,ip:String,sessionId:String,timestamp:Long,var province:String="unkown",var city:String="unkown",var district:String="unkown")

6.编写日志处理脚本(LogDataPreprocessDemo.scala)运行后产生文件到本地

import java.utilimport ch.hsr.geohash.GeoHash
import com.alibaba.fastjson.JSON
import org.apache.commons.lang3.StringUtils
import org.apache.spark.sql.{Row, SparkSession}//先运行与LogData一起
object LogDataPreprocessDemo { //日志数据清洗def main(args: Array[String]): Unit = {val spark = SparkSession.builder().appName(this.getClass.getName).master("local[*]").getOrCreate()//读取要处理的日志文件val df = spark.read.textFile("file:///E:/vmwork/JavaTest/logtest/data/logs/2023-03-01")//这里填的是idea目录下的data 文件app.logimport spark.implicits._// 外部的集成信息 一般都是作为广播变量处理的//地理位置geshash编码信息 从文件中读取,转换成Map/**** * |wx4dp|北京      |北京市 |大兴区     |* * |wx4vq|北京      |北京市 |怀柔区     |* * |wx5k3|北京      |北京市 |平谷区     |* * |wx5jd|北京      |北京市 |密云县     |* * |wx4qp|北京      |北京市 |延庆县     |*/val geoMap = spark.read.parquet("file:///E:/vmwork/JavaTest/logtest/data/dict/geo_dict/output/").rdd.map({case Row(geo: String, province: String, city: String, district: String)=> (geo, (province, city, district))  //变成map(key(value))}).collectAsMap()// 将地理位置geohash为key,省市区为value的map广播val bcGeo = spark.sparkContext.broadcast(geoMap)  //码表广播变量出去//guid 回补val idmpMap = spark.read.parquet("file:///E:/vmwork/JavaTest/logtest/data/dict/idmp/output").rdd.map({case Row(id_hashcode: Long, guid: Long) => (id_hashcode, guid)}).collectAsMap()val bcIdMap = spark.sparkContext.broadcast(idmpMap)val result = df.map(line => {try {//将一行记录 转换成json对象val jonsObject = JSON.parseObject(line);val eventId = jonsObject.getString("eventid")//java map  ---》 scala mapimport scala.collection.JavaConversions._//        jonsObject.getJSONObject("event").getInnerMap().asInstanceOf[util.Map[String,String]].toMap.varval event: Map[String, String] = jonsObject.getJSONObject("event").getInnerMap().asInstanceOf[util.Map[String, String]].toMap //util的map转换成scala的mapval userObject = jonsObject.getJSONObject("user")val uid = userObject.getString("uid")val phoneObject = userObject.getJSONObject("phone")val imei = phoneObject.getString("imei")val mac = phoneObject.getString("mac")val imsi = phoneObject.getString("imsi")val osName = phoneObject.getString("osName")val osVer = phoneObject.getString("osVer")val androidId = phoneObject.getString("androidId")val resolution = phoneObject.getString("resolution")val deviceType = phoneObject.getString("deviceType")val deviceId = phoneObject.getString("deviceId")val uuid = phoneObject.getString("uuid")val appObject = userObject.getJSONObject("app")val appid = appObject.getString("appid")val appVer = appObject.getString("appVer")val release_ch = appObject.getString("release_ch")val promotion_ch = appObject.getString("promotion_ch")val locObject = userObject.getJSONObject("loc")val areacode = locObject.getString("areacode")val longtitude = locObject.getDouble("longtitude")val latitude = locObject.getDouble("latitude")val carrier = locObject.getString("carrier")val netType = locObject.getString("netType")val cid_sn = locObject.getString("cid_sn")val ip = locObject.getString("ip")val sessionId = userObject.getString("sessionId")val timestamp = jonsObject.getString("timestamp").toLong///数据解析,,扁平化val sb = new StringBuilder()//#数据清洗// 过滤掉日志中: uid|uuid|mac|imei|androidId|imsi 全为空的记录//过滤掉缺少关键字段的记录,eventid、sessionId、event  缺任何一个都不可// null 替换成 " "val flagFields = sb.append(uid).append(mac).append(imei).append(imsi).append(androidId).append(uuid).toString().replaceAll("null", "") //如果是null 则替换成空var logData: LogDataDemo = nullif (StringUtils.isNotBlank(flagFields) && event != null && StringUtils.isNotBlank(sessionId) && StringUtils.isNotBlank(eventId)) {logData = LogDataDemo(Long.MinValue, eventId, event, uid, imei, mac, imsi, osName, osVer, androidId, release_ch, deviceType, deviceId, uuid, appid, appVer, release_ch, promotion_ch, areacode, longtitude, latitude, carrier, netType, cid_sn, ip, sessionId, timestamp)}   //StringUtils是工具判断是否为空 把这几个字段蹿成一串logData} catch {case e: Exception => null   //解析不出来为null}}).filter(_ != null) //上面没什么问题过滤出有用的信息,这部把null过滤掉.map(bean => {//获取广播变量  get(key) ==> valueval geoDict: collection.Map[String, (String, String, String)] = bcGeo.valueval idMapDict: collection.Map[Long, Long] = bcIdMap.value//获取经纬度val longtitude = bean.longtitudeval latitude = bean.latitude//拿到经纬度编码 经纬度转换省市区val geo = GeoHash.geoHashStringWithCharacterPrecision(latitude, longtitude, 5)  //得到编码val maybeTuple = geoDict.get(geo)if (maybeTuple.isDefined) {val area: (String, String, String) = maybeTuple.getbean.province = area._1bean.city = area._2bean.district = area._3} //用于处理经纬度转换成省市区//回补guid逻辑 6个值val ids=Array(bean.imei,bean.imsi,bean.androidId,bean.uuid,bean.mac,bean.uid)var find=falsefor(id<-ids if !find){val maybeLong = idMapDict.get(id.hashCode.toLong)if(maybeLong.isDefined){bean.guid=maybeLong.getfind=true}}bean}).toDF()result.write.parquet("file:///E:/vmwork/JavaTest/logtest/data/applogsout")  //最终写入//    result.show(100,false)//println(result.count())spark.close()}
}

7、编写脚本查看生产后的本地文件数据内容(RedFileDemo.scala)

import org.apache.spark.sql.{Row, SparkSession}
object RedFileDemo {def main(args: Array[String]): Unit = {val spark = SparkSession.builder().appName(this.getClass.getName).master("local[*]").getOrCreate()// val df = spark.read.text("file:///E:/vmwork/JavaTest/logtest/data/logs/2023-03-01").show(100,100)// val df = spark.read.parquet("file:///E:/vmwork/JavaTest/logtest/data/dict/geo_dict/output").show(100,100)//  val df = spark.read.parquet("file:///E:/vmwork/JavaTest/logtest/data/dict/idmp/output").show(100,100)val df = spark.read.parquet("file:///E:/vmwork/JavaTest/logtest/data/applogsout").show(100,100)spark.close()}
}

结果:

+----+---------------------+----------------------------------------------------------------------------------------------------+------+----------------+--------------------+----------------+-------+-----+----------------+----------+----------+--------+----------------+-----------+------+----------+------------+---------+------------------+------------------+-------+-------+------------+---------------+----------------------------------------+-------------+--------+------+--------+
|guid|              eventid|                                                                                               event|   uid|            imei|                 mac|            imsi| osName|osVer|       androidId|resolution|deviceType|deviceId|            uuid|      appid|appVer|release_ch|promotion_ch| areacode|        longtitude|          latitude|carrier|netType|      cid_sn|             ip|                               sessionId|    timestamp|province|  city|district|
+----+---------------------+----------------------------------------------------------------------------------------------------+------+----------------+--------------------+----------------+-------+-----+----------------+----------+----------+--------+----------------+-----------+------+----------+------------+---------+------------------+------------------+-------+-------+------------+---------------+----------------------------------------+-------------+--------+------+--------+
| 111|         webStayEvent|                                  Map(pgid -> 790, title -> , url -> http://www.kgcedu.cn/aca/pg790)|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575526150000|  unkown|unkown|  unkown|
| 222|          pgviewEvent|Map(utm_source -> , url -> http://www.kgcedu.cn/acb/pg650, referrer_host -> http://www.sina.com, ...|019415|2739315947046853|85-94-a5-10-61-ea-6e|3470586308518905|  macos| 10.0|                |      柠檬助手| MEIZU_ML5|  IKy8C2|uI595WmqORoTB7aE|cn.kgc.mall| 2.0.1|      柠檬助手|          01|110114004| 116.2965574090751|40.155567576333496|  ISP07|      N|235782374801|   141.46.13.59|sid-a533717b-af14-4ddf-8d2d-38beb904bf82|1575526083000|  unkown|unkown|  unkown|
| 111|         webStayEvent|                                  Map(pgid -> 790, title -> , url -> http://www.kgcedu.cn/aca/pg790)|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575526153000|  unkown|unkown|  unkown|
| 111|         webStayEvent|                                  Map(pgid -> 790, title -> , url -> http://www.kgcedu.cn/aca/pg790)|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575526154000|  unkown|unkown|  unkown|
| 111|viewConentDetailEvent|Map(pgId -> 790, contentID -> 00279, contentTag -> , contentType -> , contentTile -> , contentCha...|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575526558000|  unkown|unkown|  unkown|
| 111|viewConentDetailEvent|Map(pgId -> 790, contentID -> 00279, contentTag -> , contentType -> , contentTile -> , contentCha...|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575526909000|  unkown|unkown|  unkown|
| 111|viewConentDetailEvent|Map(pgId -> 790, contentID -> 00279, contentTag -> , contentType -> , contentTile -> , contentCha...|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575527404000|  unkown|unkown|  unkown|
| 111|          adShowEvent|Map(adPosition -> , pgId -> 790, adType -> , adResourceID -> , adTitle -> , adSource -> , adId -> 1)|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575529211000|  unkown|unkown|  unkown|
| 111|         adClickEvent|Map(adPosition -> , pgId -> 790, adType -> , adResourceID -> , adTitle -> , adSource -> , adId -> 7)|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575529309000|  unkown|unkown|  unkown|
| 111|viewConentDetailEvent|Map(pgId -> 267, contentID -> 00583, contentTag -> , contentType -> , contentTile -> , contentCha...|609803|8534282958025218|88-2c-e6-25-db-74-87|8904769954837648|android| 10.0|28616e716a8a1476|  OPPO软件商店| VIVO_MATE|        |cAOUyycgbGLjpBjs|cn.kgc.mall| 1.9.0|  OPPO软件商店|          12|361030203|116.41679639900781|26.608591452880322|  ISP04|     4G|551564839366|183.102.214.100|sid-4e1f2cbe-2b30-45fc-b806-8ea90a4cd617|1575530061000|  unkown|unkown|  unkown|
+----+---------------------+----------------------------------------------------------------------------------------------------+------+----------------+--------------------+----------------+-------+-----+----------------+----------+----------+--------+----------------+-----------+------+----------+------------+---------+------------------+------------------+-------+-------+------------+---------------+----------------------------------------+-------------+--------+------+--------+

问题:

报错1:Failed to locate the winutils binary in the hadoop binary pathjava.io.IOException: Could not locate
解决办法:Failed to locate the winutils binary in the hadoop binary pathjava.io.IOException: Could not locate_房石阳明i的博客-CSDN博客

报错2:Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Could not read footer for file: FileStatus{path=file:/E:/
解决办法:Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task_房石阳明i的博客-CSDN博客
注:这个问题原因是多种多样的,需要结合其他报错信息检查,此方法不一定全使用

报错3:(null) entry in command string: null chmod 0644 路径
解决办法:(null) entry in command string: null chmod 0644 路径_房石阳明i的博客-CSDN博客

相关内容