Skip to content

Commit 74b2a93

Browse files
committed
rewrite guide
1 parent ca3605d commit 74b2a93

File tree

1 file changed

+151
-118
lines changed

1 file changed

+151
-118
lines changed

docs/integrations/data-ingestion/etl-tools/vector-to-clickhouse.md

Lines changed: 151 additions & 118 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,15 @@ import PartnerBadge from '@theme/badges/PartnerBadge';
1818
<PartnerBadge/>
1919

2020
Being able to analyze your logs in real time is critical for production applications.
21-
Have you ever wondered if ClickHouse is good at storing and analyzing log data?
22-
Just checkout [Uber's experience](https://eng.uber.com/logging/) with converting their logging infrastructure from ELK to ClickHouse.
21+
ClickHouse excels at storing and analyzing log data due to it's excellent compression (up to [170x](https://clickhouse.com/blog/log-compression-170x) for logs)
22+
and ability to aggregate large amounts of data quickly.
2323

2424
This guide shows you how to use the popular data pipeline [Vector](https://vector.dev/docs/about/what-is-vector/) to tail an Nginx log file and send it to ClickHouse.
25-
The steps below would be similar for tailing any type of log file.
26-
We will assume you already have ClickHouse up and running and Vector installed (no need to start it yet though).
25+
The steps below are similar for tailing any type of log file.
26+
27+
**Prerequisites:**
28+
- You already have ClickHouse up and running
29+
- You have Vector installed
2730

2831
<VerticalStepper headerLevel="h2">
2932

@@ -47,15 +50,16 @@ ENGINE = MergeTree()
4750
ORDER BY tuple()
4851
```
4952

50-
:::note
51-
**ORDER BY** is set to **tuple()** (an empty tuple) as there is no need for a primary key yet.
52-
:::
53+
:::note
54+
**ORDER BY** is set to **tuple()** (an empty tuple) as there is no need for a primary key yet.
55+
:::
5356

5457
## Configure Nginx {#2--configure-nginx}
5558

56-
We certainly do not want to spend too much time explaining Nginx, but we also do not want to hide all the details, so in this step we will provide you with enough details to get Nginx logging configured.
59+
In this step, you will be shown how to get Nginx logging configured.
5760

58-
1. The following `access_log` property sends logs to `/var/log/nginx/my_access.log` in the **combined** format. This value goes in the `http` section of your `nginx.conf` file:
61+
1. The following `access_log` property sends logs to `/var/log/nginx/my_access.log` in the **combined** format.
62+
This value goes in the `http` section of your `nginx.conf` file:
5963

6064
```bash
6165
http {
@@ -70,125 +74,154 @@ http {
7074

7175
2. Be sure to restart Nginx if you had to modify `nginx.conf`.
7276

73-
3. Generate some log events in the access log by visiting pages on your web server. Logs in the **combined** format have the following format:
74-
```bash
75-
192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET / HTTP/1.1" 200 615 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
76-
192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
77-
192.168.208.1 - - [12/Oct/2021:03:31:49 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
78-
```
77+
3. Generate some log events in the access log by visiting pages on your web server.
78+
Logs in the **combined** format look as follows:
79+
80+
```bash
81+
192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET / HTTP/1.1" 200 615 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
82+
192.168.208.1 - - [12/Oct/2021:03:31:44 +0000] "GET /favicon.ico HTTP/1.1" 404 555 "http://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
83+
192.168.208.1 - - [12/Oct/2021:03:31:49 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
84+
```
7985

8086
## Configure Vector {#3-configure-vector}
8187

82-
Vector collects, transforms and routes logs, metrics, and traces (referred to as **sources**) to lots of different vendors (referred to as **sinks**), including out-of-the-box compatibility with ClickHouse. Sources and sinks are defined in a configuration file named **vector.toml**.
88+
Vector collects, transforms and routes logs, metrics, and traces (referred to as **sources**) to many different vendors (referred to as **sinks**), including out-of-the-box compatibility with ClickHouse.
89+
Sources and sinks are defined in a configuration file named **vector.toml**.
90+
91+
1. The following **vector.toml** file defines a **source** of type **file** that tails the end of **my_access.log**, and it also defines a **sink** as the **access_logs** table defined above:
92+
93+
```bash
94+
[sources.nginx_logs]
95+
type = "file"
96+
include = [ "/var/log/nginx/my_access.log" ]
97+
read_from = "end"
98+
99+
[sinks.clickhouse]
100+
type = "clickhouse"
101+
inputs = ["nginx_logs"]
102+
endpoint = "http://clickhouse-server:8123"
103+
database = "nginxdb"
104+
table = "access_logs"
105+
skip_unknown_fields = true
106+
```
83107

84-
1. The following **vector.toml** defines a **source** of type **file** that tails the end of **my_access.log**, and it also defines a **sink** as the **access_logs** table defined above:
85-
```bash
86-
[sources.nginx_logs]
87-
type = "file"
88-
include = [ "/var/log/nginx/my_access.log" ]
89-
read_from = "end"
108+
2. Start Vector using the configuration above. Visit the Vector [documentation](https://vector.dev/docs/) for more details on defining sources and sinks.
90109

91-
[sinks.clickhouse]
92-
type = "clickhouse"
93-
inputs = ["nginx_logs"]
94-
endpoint = "http://clickhouse-server:8123"
95-
database = "nginxdb"
96-
table = "access_logs"
97-
skip_unknown_fields = true
98-
```
110+
3. Verify that the access logs are being inserted into ClickHouse by running the following query. You should see the access logs in your table:
99111

100-
2. Start up Vector using the configuration above. <a href="https://vector.dev/docs/" target="_blank">Visit the Vector documentation</a> for more details on defining sources and sinks.
112+
```sql
113+
SELECT * FROM nginxdb.access_logs
114+
```
101115

102-
3. Verify the access logs are being inserted into ClickHouse. Run the following query and you should see the access logs in your table:
103-
```sql
104-
SELECT * FROM nginxdb.access_logs
105-
```
106-
<Image img={vector01} size="lg" border alt="View ClickHouse logs in table format" />
116+
<Image img={vector01} size="lg" border alt="View ClickHouse logs in table format" />
107117

108118
## Parse the Logs {#4-parse-the-logs}
109119

110-
Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis. Let's see how to parse the log events using a materialized view.
111-
112-
1. A **materialized view** (MV, for short) is a new table based on an existing table, and when inserts are made to the existing table, the new data is also added to the materialized view. Let's see how to define a MV that contains a parsed representation of the log events in **access_logs**, in other words:
113-
```bash
114-
192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
115-
```
116-
117-
There are various functions in ClickHouse to parse the string, but for starters let's take a look at **splitByWhitespace** - which parses a string by whitespace and returns each token in an array. To demonstrate, run the following command:
118-
```sql
119-
SELECT splitByWhitespace('192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"')
120-
```
121-
122-
Notice the response is pretty close to what we want! A few of the strings have some extra characters, and the user agent (the browser details) did not need to be parsed, but we will resolve that in the next step:
123-
```text
124-
["192.168.208.1","-","-","[12/Oct/2021:15:32:43","+0000]","\"GET","/","HTTP/1.1\"","304","0","\"-\"","\"Mozilla/5.0","(Macintosh;","Intel","Mac","OS","X","10_15_7)","AppleWebKit/537.36","(KHTML,","like","Gecko)","Chrome/93.0.4577.63","Safari/537.36\""]
125-
```
126-
127-
2. Similar to **splitByWhitespace**, the **splitByRegexp** function splits a string into an array based on a regular expression. Run the following command, which returns two strings.
128-
```sql
129-
SELECT splitByRegexp('\S \d+ "([^"]*)"', '192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"')
130-
```
131-
132-
Notice the second string returned is the user agent successfully parsed from the log:
133-
```text
134-
["192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] \"GET / HTTP/1.1\" 30"," \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36\""]
135-
```
136-
137-
3. Before looking at the final **CREATE MATERIALIZED VIEW** command, let's view a couple more functions used to cleanup the data. For example, the `RequestMethod` looks like **"GET** with an unwanted double-quote. Run the following **trim** function, which removes the double quote:
138-
```sql
139-
SELECT trim(LEADING '"' FROM '"GET')
140-
```
141-
142-
4. The time string has a leading square bracket, and also is not in a format that ClickHouse can parse into a date. However, if we change the separator from a colon (**:**) to a comma (**,**) then the parsing works great:
143-
```sql
144-
SELECT parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM '[12/Oct/2021:15:32:43'), ':', ' '))
145-
```
146-
147-
5. We are now ready to define our materialized view. Our definition includes **POPULATE**, which means the existing rows in **access_logs** will be processed and inserted right away. Run the following SQL statement:
148-
```sql
149-
CREATE MATERIALIZED VIEW nginxdb.access_logs_view
150-
(
151-
RemoteAddr String,
152-
Client String,
153-
RemoteUser String,
154-
TimeLocal DateTime,
155-
RequestMethod String,
156-
Request String,
157-
HttpVersion String,
158-
Status Int32,
159-
BytesSent Int64,
160-
UserAgent String
161-
)
162-
ENGINE = MergeTree()
163-
ORDER BY RemoteAddr
164-
POPULATE AS
165-
WITH
166-
splitByWhitespace(message) as split,
167-
splitByRegexp('\S \d+ "([^"]*)"', message) as referer
168-
SELECT
169-
split[1] AS RemoteAddr,
170-
split[2] AS Client,
171-
split[3] AS RemoteUser,
172-
parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM split[4]), ':', ' ')) AS TimeLocal,
173-
trim(LEADING '"' FROM split[6]) AS RequestMethod,
174-
split[7] AS Request,
175-
trim(TRAILING '"' FROM split[8]) AS HttpVersion,
176-
split[9] AS Status,
177-
split[10] AS BytesSent,
178-
trim(BOTH '"' from referer[2]) AS UserAgent
179-
FROM
180-
(SELECT message FROM nginxdb.access_logs)
181-
```
182-
183-
6. Now verify it worked. You should see the access logs nicely parsed into columns:
184-
```sql
185-
SELECT * FROM nginxdb.access_logs_view
186-
```
187-
<Image img={vector02} size="lg" border alt="View parsed ClickHouse logs in table format" />
188-
189-
:::note
190-
The lesson above stored the data in two tables, but you could change the initial `nginxdb.access_logs` table to use the **Null** table engine - the parsed data will still end up in the `nginxdb.access_logs_view` table, but the raw data will not be stored in a table.
191-
:::
120+
Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis.
121+
We'll next look at how to parse the log events using a [materialized view](/materialized-view/incremental-materialized-view).
122+
123+
A **materialized view** functions similarly to an insert trigger in SQL. When rows of data are inserted into a source table, the materialized view makes some transformation of these rows and inserts the results into a target table.
124+
The materialized view can be configured to configure a parsed representation of the log events in **access_logs**.
125+
An example of one such log event is shown below:
126+
127+
```bash
128+
192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
129+
```
130+
131+
There are various functions in ClickHouse to parse the above string. The [`splitByWhitespace`](/sql-reference/functions/splitting-merging-functions#splitByWhitespace) function parses a string by whitespace and returns each token in an array.
132+
To demonstrate, run the following command:
133+
134+
```sql title="Query"
135+
SELECT splitByWhitespace('192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"')
136+
```
137+
138+
```text title="Response"
139+
["192.168.208.1","-","-","[12/Oct/2021:15:32:43","+0000]","\"GET","/","HTTP/1.1\"","304","0","\"-\"","\"Mozilla/5.0","(Macintosh;","Intel","Mac","OS","X","10_15_7)","AppleWebKit/537.36","(KHTML,","like","Gecko)","Chrome/93.0.4577.63","Safari/537.36\""]
140+
```
141+
142+
A few of the strings have some extra characters, and the user agent (the browser details) did not need to be parsed, but
143+
the resulting array is close to what is needed.
144+
145+
Similar to `splitByWhitespace`, the [`splitByRegexp`](/sql-reference/functions/splitting-merging-functions#splitByRegexp) function splits a string into an array based on a regular expression.
146+
Run the following command, which returns two strings.
147+
148+
```sql
149+
SELECT splitByRegexp('\S \d+ "([^"]*)"', '192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"')
150+
```
151+
152+
Notice that the second string returned is the user agent successfully parsed from the log:
153+
154+
```text
155+
["192.168.208.1 - - [12/Oct/2021:15:32:43 +0000] \"GET / HTTP/1.1\" 30"," \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36\""]
156+
```
157+
158+
Before looking at the final `CREATE MATERIALIZED VIEW` command, let's view a couple more functions used to clean up the data.
159+
For example, the value of `RequestMethod` is `"GET` containing an unwanted double-quote.
160+
You can use the [`trim`](/sql-reference/functions/string-functions#trim) function to remove the double quote:
161+
162+
```sql
163+
SELECT trim(LEADING '"' FROM '"GET')
164+
```
165+
166+
The time string has a leading square bracket, and is also not in a format that ClickHouse can parse into a date.
167+
However, if we change the separator from a colon (**:**) to a comma (**,**) then the parsing works great:
168+
169+
```sql
170+
SELECT parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM '[12/Oct/2021:15:32:43'), ':', ' '))
171+
```
172+
173+
We are now ready to define the materialized view.
174+
The definition below includes `POPULATE`, which means the existing rows in **access_logs** will be processed and inserted right away.
175+
Run the following SQL statement:
176+
177+
```sql
178+
CREATE MATERIALIZED VIEW nginxdb.access_logs_view
179+
(
180+
RemoteAddr String,
181+
Client String,
182+
RemoteUser String,
183+
TimeLocal DateTime,
184+
RequestMethod String,
185+
Request String,
186+
HttpVersion String,
187+
Status Int32,
188+
BytesSent Int64,
189+
UserAgent String
190+
)
191+
ENGINE = MergeTree()
192+
ORDER BY RemoteAddr
193+
POPULATE AS
194+
WITH
195+
splitByWhitespace(message) as split,
196+
splitByRegexp('\S \d+ "([^"]*)"', message) as referer
197+
SELECT
198+
split[1] AS RemoteAddr,
199+
split[2] AS Client,
200+
split[3] AS RemoteUser,
201+
parseDateTimeBestEffort(replaceOne(trim(LEADING '[' FROM split[4]), ':', ' ')) AS TimeLocal,
202+
trim(LEADING '"' FROM split[6]) AS RequestMethod,
203+
split[7] AS Request,
204+
trim(TRAILING '"' FROM split[8]) AS HttpVersion,
205+
split[9] AS Status,
206+
split[10] AS BytesSent,
207+
trim(BOTH '"' from referer[2]) AS UserAgent
208+
FROM
209+
(SELECT message FROM nginxdb.access_logs)
210+
```
211+
212+
Now verify it worked.
213+
You should see the access logs nicely parsed into columns:
214+
215+
```sql
216+
SELECT * FROM nginxdb.access_logs_view
217+
```
218+
219+
<Image img={vector02} size="lg" border alt="View parsed ClickHouse logs in table format" />
220+
221+
:::note
222+
The lesson above stored the data in two tables, but you could change the initial `nginxdb.access_logs` table to use the [`Null`](/engines/table-engines/special/null) table engine.
223+
The parsed data will still end up in the `nginxdb.access_logs_view` table, but the raw data will not be stored in a table.
224+
:::
192225

193226
</VerticalStepper>
194227

0 commit comments

Comments
 (0)