ETL复习(1)

ETL 第一篇

背景

最近找工作,有些大数据岗位我想投,但是奈何之前的工作内容大数据不是主业,大数据经验不够看,我最早要追溯到15年当时spark+hive,然后17年的storm+hbase,到最近的flink+ck,我觉得我努把力看能不能够一够大数据相关的岗位。

基础环境准备

把我给媳妇儿配的打LOL的电脑,偷偷拿来用一用,当成小型服务器,反正性能对LOL来说,很过剩了,不影响。

我之前鼓捣其它技术的时候就在电脑上装了虚拟机,所以也不折腾了,直接装个ubuntu,然后装个docker+docker compose,就差不多了。

docker镜像源

单独说下,因为docker默认用的国外的镜像源所以安装后几乎是不可用的,这时候需要配置国内的镜像。
要注意验证镜像源,比如通过curl等命令,看是否能正常访问是否能免验证访问,我就是被阿里云的镜像加速器耽搁了小半小时,就是按照官方的配置始终403,最后才发现,原理阿里前几个月更新了协议,大概意思是,不再支持外部直接用加速镜像,而是支持阿里云本身的产品使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 1. 验证镜像源
curl 镜像源
# 2. 添加镜像源
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": [
"https://xxxx"
]
}
EOF

# 3. 使其生效
sudo systemctl daemon-reload
sudo systemctl restart docker

# 4. 查看镜像是否修改成功
docker info

# 5. 拉取镜像验证
docker pull xxx

安装CK

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# 1. 获取ck镜像
docker pull clickhouse/clickhouse-server
# 2. 添加ck需要的目录
mkdir -p /data/clickhouse/data /data/clickhouse/config /data/clickhouse/logs

# 3. ck的配置
cat > /data/clickhouse/config/config.xml << EOF
<?xml version="1.0"?>
<yandex>
<logger>
<level>information</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
</logger>

<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<interserver_http_port>9009</interserver_http_port>

<listen_host>0.0.0.0</listen_host>

<max_connections>4096</max_connections>
<keep_alive_timeout>10</keep_alive_timeout>
<max_concurrent_queries>100</max_concurrent_queries>
<uncompressed_cache_size>8589934592</uncompressed_cache_size>
<mark_cache_size>5368709120</mark_cache_size>

<path>/var/lib/clickhouse/</path>
<tmp_path>/var/lib/clickhouse/tmp/</tmp_path>

<user_directories>
<users_xml>
<path>/etc/clickhouse-server/users.xml</path>
</users_xml>
</user_directories>

<timezone>UTC</timezone>
</yandex>
EOF

# 4. ck用户管理
cat > /data/clickhouse/config/users.xml << EOF
<?xml version="1.0"?>
<yandex>
<users>
<default>
<password>yourpassword</password>
<networks>
<ip>::/0</ip>
</networks>
<profile>default</profile>
<quota>default</quota>
</default>
</users>

<profiles>
<default>
<max_memory_usage>10000000000</max_memory_usage>
<use_uncompressed_cache>0</use_uncompressed_cache>
<load_balancing>random</load_balancing>
</default>
</profiles>

<quotas>
<default>
<interval>
<duration>3600</duration>
<queries>0</queries>
<errors>0</errors>
<result_rows>0</result_rows>
<read_rows>0</read_rows>
<execution_time>0</execution_time>
</interval>
</default>
</quotas>
</yandex>
EOF

# 5.运行容器

docker run -d \
--name clickhouse-server \
--ulimit nofile=262144:262144 \
-p 8123:8123 \
-p 9000:9000 \
-p 9009:9009 \
-v /data/clickhouse/data:/var/lib/clickhouse \
-v /data/clickhouse/config/config.xml:/etc/clickhouse-server/config.xml \
-v /data/clickhouse/config/users.xml:/etc/clickhouse-server/users.xml \
-v /data/clickhouse/logs:/var/log/clickhouse-server \
--restart=always \
clickhouse/clickhouse-server:latest

# 6. 测试是否可用(内部)
docker exec -it clickhouse-server clickhouse-client --password yourpassword

# 7.暴露到外部可访问,由于不想每次run都写一长串,也为了后续方便管理其它容器,把docker compose装上
apt update
apt install -y docker-compose
# 8.compose文件编写,别忘了暴露environment
nano /data/clickhouse/docker-compose.yml

version: '3'
services:
clickhouse:
image: clickhouse/clickhouse-server:latest
container_name: clickhouse-server
restart: always
ports:
- "8123:8123"
- "9000:9000"
- "9009:9009"
volumes:
- /data/clickhouse/data:/var/lib/clickhouse
- /data/clickhouse/config/config.xml:/etc/clickhouse-server/config.xml
- /data/clickhouse/config/users.xml:/etc/clickhouse-server/users.xml
- /data/clickhouse/logs:/var/log/clickhouse-server
environment:
- CLICKHOUSE_USER=default
- CLICKHOUSE_PASSWORD=xxxx
ulimits:
nofile:
soft: 262144
hard: 262144
# 删除ck容器后重启
cd /data/clickhouse
docker-compose up -d
# 9. 看是否正常返回
curl "http://xx:8123/?user=default&password=xx&query=SELECT%201"


还有待续….

参考

https://www.coderjia.cn/archives/dba3f94c-a021-468a-8ac6-e840f85867ea
https://hub.docker.com/r/clickhouse/clickhouse-server/