ODPSWriter插件用于实现往ODPS插入或者更新数据,主要提供给etl开发同学将业务数据导入odps,适合于TB,GB数量级的数据传输,如果需要传输PB量级的数据,请选择dt task工具 ;
在底层实现上,ODPSWriter是通过DT Tunnel写入ODPS系统的,有关ODPS的更多技术细节请参看 ODPS主站 https://data.aliyun.com/product/odps 和ODPS产品文档 https://help.aliyun.com/product/27797.html
目前 DataX3 依赖的 SDK 版本是:
<dependency>
<groupId>com.aliyun.odps</groupId>
<artifactId>odps-sdk-core-internal</artifactId>
<version>0.13.2</version>
</dependency>
注意: 如果你需要使用ODPSReader/Writer插件,请务必使用JDK 1.6-32及以上版本 使用java -version查看Java版本号
这里使用一份从内存产生到ODPS导入的数据。
{
"job": {
"setting": {
"speed": {
"byte": 1048576
}
},
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [
{
"value": "DataX",
"type": "string"
},
{
"value": "test",
"type": "bytes"
}
],
"sliceRecordCount": 100000
}
},
"writer": {
"name": "odpswriter",
"parameter": {
"project": "chinan_test",
"table": "odps_write_test00_partitioned",
"partition": "school=SiChuan-School,class=1",
"column": [
"id",
"name"
],
"accessId": "xxx",
"accessKey": "xxxx",
"truncate": true,
"odpsServer": "http://sxxx/api",
"tunnelServer": "http://xxx"
}
}
}
]
}
}
accessId
accessKey
project
table
partition
column
truncate
描述:ODPSWriter通过配置"truncate": true,保证写入的幂等性,即当出现写入失败再次运行时,ODPSWriter将清理前述数据,并导入新数据,这样可以保证每次重跑之后的数据都保持一致。
truncate选项不是原子操作!ODPS SQL无法做到原子性。因此当多个任务同时向一个Table/Partition清理分区时候,可能出现并发时序问题,请务必注意!针对这类问题,我们建议尽量不要多个作业DDL同时操作同一份分区,或者在多个并发作业启动前,提前创建分区。
必选:是
默认值:无
odpsServer
tunnelServer
类似ODPSReader,目前ODPSWriter支持大部分ODPS类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。
下面列出ODPSWriter针对ODPS类型转换列表:
DataX 内部类型 | ODPS 数据类型 |
---|---|
Long | bigint |
Double | double |
String | string |
Date | datetime |
Boolean | bool |
建表语句:
use cdo_datasync;
create table datax3_odpswriter_perf_10column_1kb_00(
s_0 string,
bool_1 boolean,
bi_2 bigint,
dt_3 datetime,
db_4 double,
s_5 string,
s_6 string,
s_7 string,
s_8 string,
s_9 string
)PARTITIONED by (pt string,year string);
单行记录类似于:
s_0 : 485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*
bool_1 : true
bi_2 : 1696248667889
dt_3 : 2013-07-0600: 00: 00
db_4 : 3.141592653578
s_5 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209
s_6 : 100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209
s_7 : 100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209
s_8 : 100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209
s_9 : 12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209
执行DataX的机器参数为:
任务配置为:
{
"job": {
"setting": {
"speed": {
"channel": "1,2,4,5,6,8,16,32,64"
}
},
"content": [
{
"reader": {
"name": "odpsreader",
"parameter": {
"accessId": "******************************",
"accessKey": "*****************************",
"column": [
"*"
],
"partition": [
"pt=20141010000000,year=2014"
],
"odpsServer": "http://service.odps.aliyun.com/api",
"project": "cdo_datasync",
"table": "datax3_odpswriter_perf_10column_1kb_00",
"tunnelServer": "http://dt.odps.aliyun.com"
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": false,
"column": [
{
"value": "485924f6ab7f272af361cd3f7f2d23e0d764942351#$%^&fdafdasfdas%%^(*&^^&*"
},
{
"value": "true",
"type": "bool"
},
{
"value": "1696248667889",
"type": "long"
},
{
"type": "date",
"value": "2013-07-06 00:00:00",
"dateFormat": "yyyy-mm-dd hh:mm:ss"
},
{
"value": "3.141592653578",
"type": "double"
},
{
"value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209"
},
{
"value": "100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11fdsafdsfdsa209"
},
{
"value": "100DAFDSAFDSAHOFJDPSAWIFDISHAF;dsadsafdsahfdsajf;dsfdsa;FJDSAL;11209"
},
{
"value": "100dafdsafdsahofjdpsawifdishaf;DSADSAFDSAHFDSAJF;dsfdsa;fjdsal;11209"
},
{
"value": "12~!2345100dafdsafdsahofjdpsawifdishaf;dsadsafdsahfdsajf;dsfdsa;fjdsal;11209"
}
]
}
}
}
]
}
}
并发任务数 | blockSizeInMB | DataX速度(Rec/s) | DataX流量(MB/S) | 网卡流量(MB/S) | DataX运行负载 |
---|---|---|---|---|---|
1 | 32 | 30303 | 13.03 | 14.5 | 0.12 |
1 | 64 | 38461 | 16.54 | 16.5 | 0.44 |
1 | 128 | 46454 | 20.55 | 26.7 | 0.47 |
1 | 256 | 52631 | 22.64 | 26.7 | 0.47 |
1 | 512 | 58823 | 25.30 | 28.7 | 0.44 |
4 | 32 | 114816 | 49.38 | 55.3 | 0.75 |
4 | 64 | 147577 | 63.47 | 71.3 | 0.82 |
4 | 128 | 177744 | 76.45 | 83.2 | 0.97 |
4 | 256 | 173913 | 74.80 | 80.1 | 1.01 |
4 | 512 | 200000 | 86.02 | 95.1 | 1.41 |
8 | 32 | 204480 | 87.95 | 92.7 | 1.16 |
8 | 64 | 294224 | 126.55 | 135.3 | 1.65 |
8 | 128 | 365475 | 157.19 | 163.7 | 2.89 |
8 | 256 | 394713 | 169.83 | 176.7 | 2.72 |
8 | 512 | 241691 | 103.95 | 125.7 | 2.29 |
16 | 32 | 420838 | 181.01 | 198.0 | 2.56 |
16 | 64 | 458144 | 197.05 | 217.4 | 2.85 |
16 | 128 | 443219 | 190.63 | 210.5 | 3.29 |
16 | 256 | 315235 | 135.58 | 140.0 | 0.95 |
16 | 512 | OOM |
说明:
32
和 64
时,速度比较稳定,过分大的 blockSizeInMB 可能造成速度波动以及内存OOM。解决办法 :找ODPS Prject 的 owner给用户的云账号授权,授权语句: grant Describe,Select,Alter,Update on table [tableName] to user XXX
目前不支持通过视图到数据到odps,视图是ODPS非实体化数据存储对象,技术上无法向视图导入数据。