HdfsWriter提供向HDFS文件系统指定路径中写入TEXTFile文件和ORCFile文件,文件内容可与hive中表关联。
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["/Users/shf/workplace/txtWorkplace/job/dataorcfull.txt"],
"encoding": "UTF-8",
"column": [
{
"index": 0,
"type": "long"
},
{
"index": 1,
"type": "long"
},
{
"index": 2,
"type": "long"
},
{
"index": 3,
"type": "long"
},
{
"index": 4,
"type": "DOUBLE"
},
{
"index": 5,
"type": "DOUBLE"
},
{
"index": 6,
"type": "STRING"
},
{
"index": 7,
"type": "STRING"
},
{
"index": 8,
"type": "STRING"
},
{
"index": 9,
"type": "BOOLEAN"
},
{
"index": 10,
"type": "date"
},
{
"index": 11,
"type": "date"
}
],
"fieldDelimiter": "\t"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://xxx:port",
"fileType": "orc",
"path": "/user/hive/warehouse/writerorc.db/orcfull",
"fileName": "xxxx",
"column": [
{
"name": "col1",
"type": "TINYINT"
},
{
"name": "col2",
"type": "SMALLINT"
},
{
"name": "col3",
"type": "INT"
},
{
"name": "col4",
"type": "BIGINT"
},
{
"name": "col5",
"type": "FLOAT"
},
{
"name": "col6",
"type": "DOUBLE"
},
{
"name": "col7",
"type": "STRING"
},
{
"name": "col8",
"type": "VARCHAR"
},
{
"name": "col9",
"type": "CHAR"
},
{
"name": "col10",
"type": "BOOLEAN"
},
{
"name": "col11",
"type": "date"
},
{
"name": "col12",
"type": "TIMESTAMP"
}
],
"writeMode": "append",
"fieldDelimiter": "\t",
"compress":"NONE"
}
}
}
]
}
}
defaultFS
描述:Hadoop hdfs文件系统namenode节点地址。格式:hdfs://ip:端口;例如:hdfs://127.0.0.1:9000
必选:是
默认值:无
fileType
描述:文件的类型,目前只支持用户配置为"text"或"orc"。
text表示textfile文件格式
orc表示orcfile文件格式
必选:是
默认值:无
path
描述:存储到Hadoop hdfs文件系统的路径信息,HdfsWriter会根据并发配置在Path目录下写入多个文件。为与hive表关联,请填写hive表在hdfs上的存储路径。例:Hive上设置的数据仓库的存储路径为:/user/hive/warehouse/ ,已建立数据库:test,表:hello;则对应的存储路径为:/user/hive/warehouse/test.db/hello
必选:是
默认值:无
fileName
描述:HdfsWriter写入时的文件名,实际执行时会在该文件名后添加随机的后缀作为每个线程写入实际文件名。
必选:是
默认值:无
column
描述:写入数据的字段,不支持对部分列写入。为与hive中表关联,需要指定表中所有字段名和字段类型,其中:name指定字段名,type指定字段类型。
用户可以指定Column字段信息,配置如下:
"column":
[
{
"name": "userName",
"type": "string"
},
{
"name": "age",
"type": "long"
}
]
必选:是
默认值:无
writeMode
描述:hdfswriter写入前数据清理处理模式:
必选:是
默认值:无
fieldDelimiter
描述:hdfswriter写入时的字段分隔符,需要用户保证与创建的Hive表的字段分隔符一致,否则无法在Hive表中查到数据
必选:是
默认值:无
compress
描述:hdfs文件压缩类型,默认不填写意味着没有压缩。其中:text类型文件支持压缩类型有gzip、bzip2;orc类型文件支持的压缩类型有NONE、SNAPPY(需要用户安装SnappyCodec)。
必选:否
默认值:无压缩
hadoopConfig
描述:hadoopConfig里可以配置与Hadoop相关的一些高级参数,比如HA的配置。
"hadoopConfig":{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.aliDfs.namenode1": "",
"dfs.namenode.rpc-address.aliDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
必选:否
默认值:无
encoding
描述:写文件的编码配置。
必选:否
默认值:utf-8,慎重修改
haveKerberos
描述:是否有Kerberos认证,默认false
例如如果用户配置true,则配置项kerberosKeytabFilePath,kerberosPrincipal为必填。
必选:haveKerberos 为true必选
默认值:false
kerberosKeytabFilePath
描述:Kerberos认证 keytab文件路径,绝对路径
必选:否
默认值:无
kerberosPrincipal
描述:Kerberos认证Principal名,如xxxx/hadoopclient@xxx.xxx
必选:haveKerberos 为true必选
默认值:无
目前 HdfsWriter 支持大部分 Hive 类型,请注意检查你的类型。
下面列出 HdfsWriter 针对 Hive 数据类型转换列表:
DataX 内部类型 | HIVE 数据类型 |
---|---|
Long | TINYINT,SMALLINT,INT,BIGINT |
Double | FLOAT,DOUBLE |
String | STRING,VARCHAR,CHAR |
Boolean | BOOLEAN |
Date | DATE,TIMESTAMP |
步骤一、在Hive中创建数据库、表 Hive数据库在HDFS上存储配置,在hive安装目录下 conf/hive-site.xml文件中配置,默认值为:/user/hive/warehouse 如下所示:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
Hive建库/建表语法 参考 Hive操作手册
例: (1)建立存储为textfile文件类型的表
create database IF NOT EXISTS hdfswriter;
use hdfswriter;
create table text_table(
col1 TINYINT,
col2 SMALLINT,
col3 INT,
col4 BIGINT,
col5 FLOAT,
col6 DOUBLE,
col7 STRING,
col8 VARCHAR(10),
col9 CHAR(10),
col10 BOOLEAN,
col11 date,
col12 TIMESTAMP
)
row format delimited
fields terminated by "\t"
STORED AS TEXTFILE;
text_table在hdfs上存储路径为:/user/hive/warehouse/hdfswriter.db/text_table/
(2)建立存储为orcfile文件类型的表
create database IF NOT EXISTS hdfswriter;
use hdfswriter;
create table orc_table(
col1 TINYINT,
col2 SMALLINT,
col3 INT,
col4 BIGINT,
col5 FLOAT,
col6 DOUBLE,
col7 STRING,
col8 VARCHAR(10),
col9 CHAR(10),
col10 BOOLEAN,
col11 date,
col12 TIMESTAMP
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC;
orc_table在hdfs上存储路径为:/user/hive/warehouse/hdfswriter.db/orc_table/
略
略