Iceberg Catalog
快速体验 PALO & Iceberg
使用限制
- 支持 Iceberg V1/V2 表格式。
- 支持 Position Delete。
- 2.1.3 版本开始支持 Equality Delete。
- 支持 Parquet 文件格式
- 2.1.3 版本开始支持 ORC 文件格式。
创建 Catalog
基于 Hive Metastore 创建 Catalog
和 Hive Catalog 基本一致,这里仅给出简单示例。
1CREATE CATALOG iceberg PROPERTIES (
2 'type'='hms',
3 'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
4 'hadoop.username' = 'hive',
5 'dfs.nameservices'='your-nameservice',
6 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
7 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
8 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
9 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
10);
基于 Iceberg API 创建 Catalog
使用 Iceberg API 访问元数据的方式,支持 Hadoop File System、Hive、REST、Glue、DLF 等服务作为 Iceberg 的 Catalog。
Hadoop Catalog
注意:
warehouse
的路径必须指向Database
路径的上一级。示例:如果你的表路径是:
s3://bucket/path/to/db1/table1
,那么warehouse
应该是:s3://bucket/path/to/
1CREATE CATALOG iceberg_hadoop PROPERTIES (
2 'type'='iceberg',
3 'iceberg.catalog.type' = 'hadoop',
4 'warehouse' = 'hdfs://your-host:8020/dir/key'
5);
1CREATE CATALOG iceberg_hadoop_ha PROPERTIES (
2 'type'='iceberg',
3 'iceberg.catalog.type' = 'hadoop',
4 'warehouse' = 'hdfs://your-nameservice/dir/key',
5 'dfs.nameservices'='your-nameservice',
6 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
7 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
8 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
9 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
10);
1CREATE CATALOG iceberg_s3 PROPERTIES (
2 'type'='iceberg',
3 'iceberg.catalog.type' = 'hadoop',
4 'warehouse' = 's3://bucket/dir/key',
5 's3.endpoint' = 's3.us-east-1.amazonaws.com',
6 's3.access_key' = 'ak',
7 's3.secret_key' = 'sk'
8);
Hive Metastore
1CREATE CATALOG iceberg PROPERTIES (
2 'type'='iceberg',
3 'iceberg.catalog.type'='hms',
4 'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
5 'hadoop.username' = 'hive',
6 'dfs.nameservices'='your-nameservice',
7 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
8 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
9 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
10 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
11);
AWS Glue
连接 Glue 时,如果是在非 EC2 环境,需要将 EC2 环境里的 ~/.aws 目录拷贝到当前环境里。也可以下载AWS Cli工具进行配置,这种方式也会在当前用户目录下创建.aws目录。 请升级到 PALO 2.1.7 或 3.0.3 之后的版本使用该功能。
1-- Using access key and secret key
2CREATE CATALOG glue2 PROPERTIES (
3 "type"="iceberg",
4 "iceberg.catalog.type" = "glue",
5 "glue.endpoint" = "https://glue.us-east-1.amazonaws.com/",
6 "client.credentials-provider" = "com.amazonaws.glue.catalog.credentials.ConfigAWSProvider",
7 "client.credentials-provider.glue.access_key" = "ak",
8 "client.credentials-provider.glue.secret_key" = "sk"
9);
- Iceberg 属性详情参见 Iceberg Glue Catalog
- 如果不指定
client.credentials-provider
,PALO 就会使用默认的 DefaultAWSCredentialsProviderChain,它会读取系统环境变量或者 InstanceProfile 中配置的属性。
阿里云 DLF
参见阿里云 DLF Catalog 配置
REST Catalog
该方式需要预先提供 REST 服务,用户需实现获取 Iceberg 元数据的 REST 接口。
1CREATE CATALOG iceberg PROPERTIES (
2 'type'='iceberg',
3 'iceberg.catalog.type'='rest',
4 'uri' = 'http://172.21.0.1:8181'
5);
如果使用 HDFS 存储数据,并开启了高可用模式,还需在 Catalog 中增加 HDFS 高可用配置:
1CREATE CATALOG iceberg PROPERTIES (
2 'type'='iceberg',
3 'iceberg.catalog.type'='rest',
4 'uri' = 'http://172.21.0.1:8181',
5 'dfs.nameservices'='your-nameservice',
6 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
7 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.1:8020',
8 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.2:8020',
9 'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
10);
Google Dataproc Metastore
1CREATE CATALOG iceberg PROPERTIES (
2 "type"="iceberg",
3 "iceberg.catalog.type"="hms",
4 "hive.metastore.uris" = "thrift://172.21.0.1:9083",
5 "gs.endpoint" = "https://storage.googleapis.com",
6 "gs.region" = "us-east-1",
7 "gs.access_key" = "ak",
8 "gs.secret_key" = "sk",
9 "use_path_style" = "true"
10);
hive.metastore.uris: Dataproc Metastore 服务开放的接口,在 Metastore 管理页面获取:Dataproc Metastore Services.
Iceberg On Object Storage
若数据存放在 S3 上,properties 中可以使用以下参数:
1"s3.access_key" = "ak"
2"s3.secret_key" = "sk"
3"s3.endpoint" = "s3.us-east-1.amazonaws.com"
4"s3.region" = "us-east-1"
数据存放在阿里云 OSS 上:
1"oss.access_key" = "ak"
2"oss.secret_key" = "sk"
3"oss.endpoint" = "oss-cn-beijing-internal.aliyuncs.com"
4"oss.region" = "oss-cn-beijing"
数据存放在腾讯云 COS 上:
1"cos.access_key" = "ak"
2"cos.secret_key" = "sk"
3"cos.endpoint" = "cos.ap-beijing.myqcloud.com"
4"cos.region" = "ap-beijing"
数据存放在华为云 OBS 上:
1"obs.access_key" = "ak"
2"obs.secret_key" = "sk"
3"obs.endpoint" = "obs.cn-north-4.myhuaweicloud.com"
4"obs.region" = "cn-north-4"
示例
1-- MinIO & Rest Catalog
2CREATE CATALOG `iceberg` PROPERTIES (
3 "type" = "iceberg",
4 "iceberg.catalog.type" = "rest",
5 "uri" = "http://10.0.0.1:8181",
6 "warehouse" = "s3://bucket",
7 "token" = "token123456",
8 "s3.access_key" = "ak",
9 "s3.secret_key" = "sk",
10 "s3.endpoint" = "http://10.0.0.1:9000",
11 "s3.region" = "us-east-1"
12);
列类型映射
Iceberg Type | PALO Type |
---|---|
boolean | boolean |
int | int |
long | bigint |
float | float |
double | double |
decimal(p,s) | decimal(p,s) |
date | date |
uuid | string |
timestamp (Timestamp without timezone) | datetime(6) |
timestamptz (Timestamp with timezone) | datetime(6) |
string | string |
fixed(L) | char(L) |
binary | string |
struct | struct(2.1.3 版本开始支持) |
map | map(2.1.3 版本开始支持) |
list | array |
time | 不支持 |
Time Travel
支持读取 Iceberg 表指定的 Snapshot。
每一次对 iceberg 表的写操作都会产生一个新的快照。
默认情况下,读取请求只会读取最新版本的快照。
可以使用 FOR TIME AS OF
和 FOR VERSION AS OF
语句,根据快照 ID 或者快照产生的时间读取历史版本的数据。示例如下:
SELECT * FROM iceberg_tbl FOR TIME AS OF "2022-10-07 17:20:37";
SELECT * FROM iceberg_tbl FOR VERSION AS OF 868895038966572;
另外,可以使用 iceberg_meta 表函数查询指定表的 snapshot 信息。