DataX 读写 BOS
更新时间:2024-08-15
DataX
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。
配置
- 下载并解压 DataX;
- 下载BOS-HDFS 解压缩后,将 jar 包拷贝到 DataX 解压路径plugin/reader/hdfsreader/libs/ 以及 plugin/writer/hdfswriter/libs/ 下;
- 打开 DataX 解压目录下的 bin/datax.py 脚本,修改脚本中的 CLASS_PATH 变量为如下:
Bash
1CLASS_PATH = ("%s/lib/*:%s/plugin/reader/hdfsreader/libs/*:%s/plugin/writer/hdfswriter/libs/*:.") % (DATAX_HOME, DATAX_HOME, DATAX_HOME)
开始
示例
将 {your bucket} 下的 testfile 文件读出并写入到 {your other bucket} 存储桶。
testfile:
Bash
11 hello
22 bos
33 world
bos2bos.json:
JSON
1{
2 "job": {
3 "setting": {
4 "speed": {
5 "channel": 1
6 },
7 "errorLimit": {
8 "record": 0,
9 "percentage": 0.02
10 }
11 },
12 "content": [{
13 "reader": {
14 "name": "hdfsreader",
15 "parameter": {
16 "path": "/testfile",
17 "defaultFS": "bos://{your bucket}/",
18 "column": [
19 {
20 "index": 0,
21 "type": "long"
22 },
23 {
24 "index": 1,
25 "type": "string"
26 }
27 ],
28 "fileType": "text",
29 "encoding": "UTF-8",
30 "hadoopConfig": {
31 "fs.bos.endpoint": "bj.bcebos.com",
32 "fs.bos.impl": "org.apache.hadoop.fs.bos.BaiduBosFileSystem",
33 "fs.bos.access.key": "{your ak}",
34 "fs.bos.secret.access.key": "{your sk}"
35 },
36 "fieldDelimiter": " "
37 }
38 },
39 "writer": {
40 "name": "hdfswriter",
41 "parameter": {
42 "path": "/testtmp",
43 "fileName": "testfile.new",
44 "defaultFS": "bos://{your other bucket}/",
45 "column": [{
46 "name": "col1",
47 "type": "string"
48 },
49 {
50 "name": "col2",
51 "type": "string"
52 }
53 ],
54 "fileType": "text",
55 "encoding": "UTF-8",
56 "hadoopConfig": {
57 "fs.defaultFS": ""
58 "fs.bos.endpoint": "bj.bcebos.com",
59 "fs.bos.impl": "org.apache.hadoop.fs.bos.BaiduBosFileSystem",
60 "fs.bos.access.key": "{your ak}",
61 "fs.bos.secret.access.key": "{your sk}"
62 },
63 "fieldDelimiter": " ",
64 "writeMode": "append"
65 }
66 }
67 }]
68 }
69}
按需替换配置中的 {your bucket}、endpoint、{your sk} 等选项;
支持仅 reader 或 writer 配置为 BOS。
结果
Python
1python bin/datax.py bos2bos.json
执行成功后返回: