diff --git a/queries/cdmq/README.md b/queries/cdmq/README.md index 215d442f..9c5d034c 100644 --- a/queries/cdmq/README.md +++ b/queries/cdmq/README.md @@ -1,26 +1,32 @@ # cdmq ## Introduction -The contents of this directory contain a collection of scripts in Javascript intended to be executed with [node.js](https://nodejs.org). These scripts get data from an OpenSearch instance. The data must be in Common Data Format. which is documented in this project under [templates](../templates). The scripts here are meant to help inspect, compare, and export data from benchmarks and performance & resource-utilization tools, in order to report and investigate performance. -In order to generate this data, you must run a benchmark via automation framework which uses the Common Data Format and index that data into OpenSearch. One of those automation frameworks is the [crucible](https://github.com/perftool-incubator/crucible) project. A subproject of crucible, [crucible-examples](https://github.com/perftool-incubator/crucible-examples), includes scenarios to run some of these benchmarks. +The contents of this directory contain a collection of scripts in Javascript intended to be executed with [node.js](https://nodejs.org). These scripts get data from an OpenSearch instance. The data must be in Common Data Format. which is documented in this project under [templates](../templates). The scripts here are meant to help inspect, compare, and export data from benchmarks and performance & resource-utilization tools, in order to report and investigate performance. + +In order to generate this data, you must run a benchmark via automation framework which uses the Common Data Format and index that data into OpenSearch. One of those automation frameworks is the [crucible](https://github.com/perftool-incubator/crucible) project. A subproject of crucible, [crucible-examples](https://github.com/perftool-incubator/crucible-examples), includes scenarios to run some of these benchmarks. ## Terms -Many of the scripts refer to different terms we associate with either running a benchmark or examining the resulting data, and these terms are not always universally known or agreed upon for specific benchmarks (like uperf and fio), or even benchmark automation frameworks, Nevertheless, the CommonDataModel project has adopted the following terms, which originate from the crucible project. You will need to become familiar with these terms in order to use these scripts: + +Many of the scripts refer to different terms we associate with either running a benchmark or examining the resulting data, and these terms are not always universally known or agreed upon for specific benchmarks (like uperf and fio), or even benchmark automation frameworks, Nevertheless, the CommonDataModel project has adopted the following terms, which originate from the crucible project. You will need to become familiar with these terms in order to use these scripts: - benchmark: A specific benchmark, like fio, uperf, trafficgen, or oslat. -- run: An invocation of a command, like `crucible run`, which facilitates the execution of a benchmark, often running the benchmark many times. In the context `cdmq`, this usually refers to the data that was generated for that run. -- iteration: A set of parameters to execute a benchmark, for example, for uperf: `test-type: stream, wsize: 256, nthreads: 16, duration: 90`. One or more unique iterations typically make up a run -- parameter: An option used for the underlying benchmark. Most parameters are unique to a specific benchmark. -- sample: An actual execution of an iteration. Often there will be multiple samples for an iteration, in order to provide an average and standard-deviation. -- period: A time-period for a sample. When a sample is executed, there may be one or more periods which represent a certain phase for the benchmark, like warmup, measurement, etc. +- run: An invocation of a command, like `crucible run`, which facilitates the execution of a benchmark, often running the benchmark many times. In the context `cdmq`, this usually refers to the data that was generated for that run. +- iteration: A set of parameters to execute a benchmark, for example, for uperf: `test-type: stream, wsize: 256, nthreads: 16, duration: 90`. One or more unique iterations typically make up a run +- parameter: An option used for the underlying benchmark. Most parameters are unique to a specific benchmark. +- sample: An actual execution of an iteration. Often there will be multiple samples for an iteration, in order to provide an average and standard-deviation. +- period: A time-period for a sample. When a sample is executed, there may be one or more periods which represent a certain phase for the benchmark, like warmup, measurement, etc. - primary-period: A period where a benchmark's primary metric is measured -- primary-metric: A benchmark's most common metric, like `Gbps`, or `IOPS`. Each iteration has a primary-metric, but different iterations (different combinations of parameters) might have a different primary-metric. For example uperf samples with `test-type: stream` have a primary metric of `Gbps`, while `test-type: rr` uses `transactions-sec` +- primary-metric: A benchmark's most common metric, like `Gbps`, or `IOPS`. Each iteration has a primary-metric, but different iterations (different combinations of parameters) might have a different primary-metric. For example uperf samples with `test-type: stream` have a primary metric of `Gbps`, while `test-type: rr` uses `transactions-sec` - metric: Some unit of measure, either a measure of throughput (work/time) or a "count" (elapsed-time, latency, level, occupancy, etc), or a simple "pass/fail" + ## Scripts -Below are documented most common scripts used for this project. All of these scripts can be run via `node ./script-name.js`, and some have wrapper scripts `script-name.sh` which provide the casual user a more convenient invocation. If you are using [crucible](https://github.com/perftool-incubator/crucible), it may provide an alternative way to use this script (documented in each script subsection below). + +Below are documented most common scripts used for this project. All of these scripts can be run via `node ./script-name.js`, and some have wrapper scripts `script-name.sh` which provide the casual user a more convenient invocation. If you are using [crucible](https://github.com/perftool-incubator/crucible), it may provide an alternative way to use this script (documented in each script subsection below). + ### get-result-summary.js -This script produces a summary of a single run., including tags, metrics present, as well as all the iterations and their samples. To run this script, you must specify a run-id: `node ./get-result-summary.js --run 0bda53c3-f0b2-416a-be54-cee738b75010`. If you are using the crucible project, you will likely be using the crucible command-line `crucible get result --run 0bda53c3-f0b2-416a-be54-cee738b75010`. In this example, the following output is produced: + +This script produces a summary of a single run., including tags, metrics present, as well as all the iterations and their samples. To run this script, you must specify a run-id: `node ./get-result-summary.js --run 0bda53c3-f0b2-416a-be54-cee738b75010`. If you are using the crucible project, you will likely be using the crucible command-line `crucible get result --run 0bda53c3-f0b2-416a-be54-cee738b75010`. In this example, the following output is produced: run-id: 0bda53c3-f0b2-416a-be54-cee738b75010 tags: datapath=ovn-k-tc irq=bal kernel=4.18.0-305.34.2.el8_4.x86_64 mtu=1400 offload=False osruntime=chroot pods-per-worker=16 proto=tcp rcos=410.84.202202110840-0 scale_out_factor=1 sdn=OVNKubernetes test=stream topo=internode userenv=stream8 @@ -101,9 +107,12 @@ This script produces a summary of a single run., including tags, metrics present primary period-id: 4F1014D6-AD33-11EC-94E3-ADE96E3275F7 period range: begin: 1648111546729 end: 1648111635267 result: (Gbps) samples: 6.37 6.26 6.30 mean: 6.31 min: 6.26 max: 6.37 stddev: 0.05 stddevpct: 0.87 -When investigating performance, users often start with a get-result-summary, and then drill-down to a specific instance to view various metrics. Note that all timestamps are millisecond epoch-time. + +When investigating performance, users often start with a get-result-summary, and then drill-down to a specific instance to view various metrics. Note that all timestamps are millisecond epoch-time. + ### get-metric-result.js -This script is used to dig deeper into the metrics (tool or benchmark data) found in a run or period. To find out which metrics are available for a run, look at the `metrics:` section from a get-result-summary.js output: + +This script is used to dig deeper into the metrics (tool or benchmark data) found in a run or period. To find out which metrics are available for a run, look at the `metrics:` section from a get-result-summary.js output: metrics: source: procstat @@ -125,11 +134,11 @@ This script is used to dig deeper into the metrics (tool or benchmark data) foun source: sar-tasks types: Context-switches-sec Processes-created-sec -This script requires either a `--period` option or a combination of `--run`, `--begin`, and `--end`, plus a `--source` and `--type`. In the following example, a query for uperf for Gbps is used: +This script requires either a `--period` option or a combination of `--run`, `--begin`, and `--end`, plus a `--source` and `--type`. In the following example, a query for uperf for Gbps is used: # node ./get-metric-data.js --period 52FB1F1E-AD33-11EC-B16C-ADE96E3275F7 --source uperf --type Gbps This produces a JSON output for this metric: - + { "name": "uperf", "type": "Gbps", @@ -175,9 +184,9 @@ The same query can be used for tool data, such as sar: ] } -Note that the value for sar-net, L2-Gbps is quite different than what is reported for Uperf, Gbps. This can be the case for many reasons, but in this case let's show how the two are actually can be similar. First, one must understand that while these both report Gbps, the information comes from different sources. One is measuring Gbps as reported by the client program in uperf, and another is total network throughput for all systems which were running sar. +Note that the value for sar-net, L2-Gbps is quite different than what is reported for Uperf, Gbps. This can be the case for many reasons, but in this case let's show how the two are actually can be similar. First, one must understand that while these both report Gbps, the information comes from different sources. One is measuring Gbps as reported by the client program in uperf, and another is total network throughput for all systems which were running sar. -To help explain the difference, let's use the `breakout` function of the get-metric-data.js script. By default, the query is reporting this metric from all sources of `sar-net` and type `Gbps`. Depending on where the sar tool was used, it may be collecting this information from multiple hosts, and on those hosts from multiple network type (and specific interfaces, and a direction for each, and so on). These queries, by default, do not assume how the user wants to breakout and filter this metric. The user can, however, choose to include any available breakout, which are found in the output: +To help explain the difference, let's use the `breakout` function of the get-metric-data.js script. By default, the query is reporting this metric from all sources of `sar-net` and type `Gbps`. Depending on where the sar tool was used, it may be collecting this information from multiple hosts, and on those hosts from multiple network type (and specific interfaces, and a direction for each, and so on). These queries, by default, do not assume how the user wants to breakout and filter this metric. The user can, however, choose to include any available breakout, which are found in the output: "breakouts": [ "csid", @@ -187,9 +196,8 @@ To help explain the difference, let's use the `breakout` function of the get-met "type" ] -A breakout will divide the metric into multiple metrics, one for each value of that breakout. For example, if metric data that was collected has a `csid` of 1 and 2, a breakout of csid will include two metrics. In the example below, a breakout for csid and cstype are used: +A breakout will divide the metric into multiple metrics, one for each value of that breakout. For example, if metric data that was collected has a `csid` of 1 and 2, a breakout of csid will include two metrics. In the example below, a breakout for csid and cstype are used: - # node ./get-metric-data.js --period 4F1014D6-AD33-11EC-94E3-ADE96E3275F7 --source sar-net --type L2-Gbps --breakout csid,cstype { "name": "sar-net", @@ -217,13 +225,15 @@ A breakout will divide the metric into multiple metrics, one for each value of t "type" ] } -Now we can see that the metric is broken-out by `cs-type` and `cs-id`. These are terms used to describe some type of physical component in your test environment. These tests happen to be from Openshift, so the nodes where these benchmarks are run are `worker` (for cs-type) nodes with `1` and `2` (for cs-id). However, this breakout is not enough to get close to the uperf metric, but we also have more breakouts available: + +Now we can see that the metric is broken-out by `cs-type` and `cs-id`. These are terms used to describe some type of physical component in your test environment. These tests happen to be from Openshift, so the nodes where these benchmarks are run are `worker` (for cs-type) nodes with `1` and `2` (for cs-id). However, this breakout is not enough to get close to the uperf metric, but we also have more breakouts available: "breakouts": [ "dev", "direction", "type" ] + So, let's use `type`, which breaks out the Gbps by virtual and physical interfaces: # node ./get-metric-data.js --period 4F1014D6-AD33-11EC-94E3-ADE96E3275F7 --source sar-net --type L2-Gbps --breakout csid,cstype,type @@ -275,7 +285,8 @@ So, let's use `type`, which breaks out the Gbps by virtual and physical interfac "direction" ] } -We are one step closer, in that multiple metrics, such as `<1>--` and `<2>--` show a Gbps value which is close to what uperf reports. Uperf, however, reports the data transfer for the client, and when using the `stream` test-type, this is the writes that the client is doing, which would be Tx out the client and Rx into the server. To make this more clear where this is happening, let's use another breakout available to sar-net. Gbps, `direction`. + +We are one step closer, in that multiple metrics, such as `<1>--` and `<2>--` show a Gbps value which is close to what uperf reports. Uperf, however, reports the data transfer for the client, and when using the `stream` test-type, this is the writes that the client is doing, which would be Tx out the client and Rx into the server. To make this more clear where this is happening, let's use another breakout available to sar-net. Gbps, `direction`. { "name": "sar-net", @@ -357,10 +368,11 @@ We are one step closer, in that multiple metrics, such as `<1>-----` has 6.786 Gbps, not quite the same sas uperf, but uperf reports Gbps for the messages in the program, and not the Gbps for the additional headers for TCP, IP, and Ethernet. This still does not show the Gbps for a specific interface, but that can be done with another breakout. However, as shown above, more breakouts generally produces more output, some of which you may want to filter. This can be accomplished in two ways, by limiting the value for a breakout and limiting the metrics based on the metric-value. The following uses both of these methods: + +When evaluating these breakouts, we can see that `<1>---` has 6.786 Gbps, not quite the same sas uperf, but uperf reports Gbps for the messages in the program, and not the Gbps for the additional headers for TCP, IP, and Ethernet. This still does not show the Gbps for a specific interface, but that can be done with another breakout. However, as shown above, more breakouts generally produces more output, some of which you may want to filter. This can be accomplished in two ways, by limiting the value for a breakout and limiting the metrics based on the metric-value. The following uses both of these methods: # node ./get-metric-data.js --period 4F1014D6-AD33-11EC-94E3-ADE96E3275F7 --source sar-net --type L2-Gbps --breakout csid=1,cstype=worker,type=physical,direction,dev --filter gt:0.01 - + { "name": "sar-net", "type": "L2-Gbps", @@ -383,7 +395,81 @@ When evaluating these breakouts, we can see that `<1>---` }, "breakouts": [] } -So far all of the metrics have been represented as a single value for a specific time period. When `--period` is used, the script finds the begin and end times for this period, which in most cases, has a duration equal to the measurement time in the benchmark itself (around 90 seconds in these examples). One can also specify `--run`, `--begin`, and `--end` instead of `--period`, should they need to focus on a different period of time. However, for benchmark metrics (such as uperf), it is important to limit the begin and end to within the actual measurement period for that sample. Conversely, tool metrics can use a begin and end spanning any time period within the run, as the tool collection tends to run continuously for any particular run. Whatever time period is used, one can also use `--resolution` to divide this time period into multiple data-samples, in order to generate things like line graphs: + +#### Specifying Multiple Values for a Breakout + +In addition to filtering a breakout to a single value (e.g., `csid=1`), you can now specify multiple values for a breakout field using comma-separated values. This will return separate metrics for each specified value. + +For example, to get metrics for both worker nodes 1 and 2: + + # node ./get-metric-data.js --period 4F1014D6-AD33-11EC-94E3-ADE96E3275F7 --source sar-net --type L2-Gbps --breakout csid=1,2,cstype=worker,type=physical + +This will return two separate metrics: one for `csid=1` and one for `csid=2`, without including metrics for any other csid values that might exist in the data. + +**Important**: The comma separator has different meanings depending on context: + +- Between different breakout fields: `csid,cstype` means break out by both csid AND cstype +- Within a value list: `csid=1,2` means break out by csid, but only include values 1 and 2 +- Mixed usage: `csid=1,2,cstype=worker` means break out by csid (only values 1,2) and cstype (only value worker) + +This feature is particularly useful when: + +- You want to compare specific hosts or components without seeing all possible values +- You need to reduce output by focusing on a subset of values +- You want to query multiple specific values in a single command instead of running separate queries + +**Note**: Each comma-separated value in a breakout filter (e.g., `csid=1,2`) will produce separate metrics in the output, not an aggregated metric. To aggregate specific values into a single metric, use regex alternation with uppercase `R`: `hostname=R/worker-1|worker-2|worker-3/` (see regex section below). + +#### Using Regular Expressions in Breakouts + +In addition to specifying exact values or lists of values, you can use regular expressions to match multiple values dynamically. This is particularly useful when you want to match values that follow a pattern without knowing all possible values in advance. + +**Syntax**: Use `r/pattern/` for separate metrics (one per match) or `R/pattern/` for aggregated metrics (all matches combined). + +- **Lowercase `r`**: Each value matching the pattern gets its own metric (similar to `hostname=a,b,c`) +- **Uppercase `R`**: All values matching the pattern are aggregated into a single metric + +**Examples:** + +```bash +# Get separate metrics for all worker nodes matching the pattern +node ./get-metric-data.js --period --source mpstat --type Busy-CPU --breakout hostname=r/^worker-.*/ + +# Get a single aggregated metric for all client nodes +node ./get-metric-data.js --period --source sar-net --type L2-Gbps --breakout hostname=R/^client-.*/ + +# Aggregate specific values using alternation (|) +node ./get-metric-data.js --period --source mpstat --type Busy-CPU --breakout hostname=R/worker-1|worker-2|worker-3/ + +# Mix regex with other filters +node ./get-metric-data.js --period --source mpstat --type Busy-CPU --breakout hostname=r/worker-[0-9]+/,cstype=physical + +# Use different delimiter if pattern contains slashes +node ./get-metric-data.js --period --source iostat --type kB-sec --breakout dev=r|/dev/sd.*| +``` + +**Custom Delimiter**: The character immediately after `r` or `R` is used as the delimiter. While `/` is conventional, you can use any character (like `|`, `#`, `@`, `~`) if your pattern contains forward slashes. + +**Regular Expression Syntax**: The patterns use OpenSearch regex syntax, which is similar to standard regex but with some differences. Common patterns include: + +- `.*` - Match any characters (zero or more) +- `.+` - Match any characters (one or more) +- `^` - Match start of string +- `$` - Match end of string +- `[0-9]` - Match any digit +- `[a-z]` - Match any lowercase letter +- `(a|b)` - Match 'a' or 'b' + +**Use Cases:** + +- Match all nodes of a certain type: `hostname=r/^worker-.*/` +- Match numbered resources: `cpu=r/[0-9]+/` +- Match specific patterns: `device=r/^eth[0-9]/` +- Exclude certain patterns: Use regex negative lookahead if needed + +**Performance Note**: Regex patterns are evaluated by OpenSearch and may be slower than exact value matches for very large datasets. Use them when the flexibility is needed. + +So far all of the metrics have been represented as a single value for a specific time period. When `--period` is used, the script finds the begin and end times for this period, which in most cases, has a duration equal to the measurement time in the benchmark itself (around 90 seconds in these examples). One can also specify `--run`, `--begin`, and `--end` instead of `--period`, should they need to focus on a different period of time. However, for benchmark metrics (such as uperf), it is important to limit the begin and end to within the actual measurement period for that sample. Conversely, tool metrics can use a begin and end spanning any time period within the run, as the tool collection tends to run continuously for any particular run. Whatever time period is used, one can also use `--resolution` to divide this time period into multiple data-samples, in order to generate things like line graphs: # node ./get-metric-data.js --period 4F1014D6-AD33-11EC-94E3-ADE96E3275F7 --source sar-net --type L2-Gbps --breakout csid=1,cstype=worker,type=physical,direction=tx,dev --filter gt:0.01 --resolution 10 Checking for httpd...appears to be running @@ -448,24 +534,27 @@ So far all of the metrics have been represented as a single value for a specific }, "breakouts": [] } + ### compare-results.js -This script is used to generate comparisons across one or more runs and provides to the ability to tailor how iterations are grouped when comparing them. This script is particularly useful when you run a benchmark with different settings in your test-bed. For example, if you were to test a MTU of 1500 and then 9000, you could use this script to generate output that compares the two runs. You are, however, not limited to two runs, and you are not actually required to specify the run IDs at all. - -`compare-results.js` has two primary purposes. The first is to assemble the iterations to want to compare. This is done with options to the script: - * `--filter-by-params` - * `--filter-by-tags` - * `--filter-by-age` - * `--add-runs` - * `--add-iterations` - - When using the `--filter-by-*` options, iterations are queried from all three filters and then intersected. Users can focus on specific benchmark params and test-bed configurations, for example: - - `node ./compare-results.js --filter-by-params test-type:stream --filter-by-tags study:protocols --dont-breakout-params protocol` + +This script is used to generate comparisons across one or more runs and provides to the ability to tailor how iterations are grouped when comparing them. This script is particularly useful when you run a benchmark with different settings in your test-bed. For example, if you were to test a MTU of 1500 and then 9000, you could use this script to generate output that compares the two runs. You are, however, not limited to two runs, and you are not actually required to specify the run IDs at all. + +`compare-results.js` has two primary purposes. The first is to assemble the iterations to want to compare. This is done with options to the script: + +- `--filter-by-params` +- `--filter-by-tags` +- `--filter-by-age` +- `--add-runs` +- `--add-iterations` + +When using the `--filter-by-*` options, iterations are queried from all three filters and then intersected. Users can focus on specific benchmark params and test-bed configurations, for example: + +`node ./compare-results.js --filter-by-params test-type:stream --filter-by-tags study:protocols --dont-breakout-params protocol` All common tags: tuned:throughput-performance dir:forward study:protocols All common params: test-type:stream nthreads:1 duration:120 - - + + label mean stddevpct iter-id nthreads:1 wsize:256 @@ -498,16 +587,16 @@ This script is used to generate comparisons across one or more runs and provides wsize:32768 protocol:tcp 122.5500 41.2100 CFEFB0A2-B9EA-11EC-A682-01EC7B3275F7 -In the output above `--dont-breakout-params protocol` forces the `protocol` param to be pushed to the label instead of broken-out on the left. In most cases, the user will choose at least one param and/or one tag to not breakout, in order to create a "cluster" of results with labels (which can later be used to form a clustered bar chart). +In the output above `--dont-breakout-params protocol` forces the `protocol` param to be pushed to the label instead of broken-out on the left. In most cases, the user will choose at least one param and/or one tag to not breakout, in order to create a "cluster" of results with labels (which can later be used to form a clustered bar chart). Users can control both what gets pushed to the label, as well as the order of the breakout `--breakout-order-params`: # node ./compare-results.js --filter-by-params test-type:stream --filter-by-tags study:protocols --dont-breakout-params wsize --breakout-order-params protocol,threads - + All common tags: study:protocols All common params: test-type:stream duration:120 - - + + label mean stddevpct iter-id protocol:tcp nthreads:1 @@ -536,4 +625,4 @@ Users can control both what gets pushed to the label, as well as the order of th Also, `--breakout-order-tags` and `--dont-breakout-tags` are also available with similar functions. -Note that that while not required, `--filter-by-age` has a default of `0-30`, which filters iterations between 0 to 30 *days* old. This default is used so that queries do not unnecessarily query very old run data (unless you select a different age range). +Note that that while not required, `--filter-by-age` has a default of `0-30`, which filters iterations between 0 to 30 _days_ old. This default is used so that queries do not unnecessarily query very old run data (unless you select a different age range). diff --git a/queries/cdmq/cdm.js b/queries/cdmq/cdm.js index 703b77ac..bf5af706 100644 --- a/queries/cdmq/cdm.js +++ b/queries/cdmq/cdm.js @@ -2425,21 +2425,37 @@ getBreakoutAggregation = function (source, type, breakout) { breakout.forEach((field) => { //if (/([^\=]+)\=([^\=]+)/.exec(field)) { var matches = regExp.exec(field); + var shouldAggregate = true; // default: include in aggregation + if (matches) { //field = $1; - field = matches[1]; + var fieldName = matches[1]; + var value = matches[2]; + + // Check if this is an aggregated regex pattern (R/pattern/) + // If uppercase R, we should NOT add this field to the aggregation + // (all matches will be combined into a single metric) + if (/^R./.test(value)) { + shouldAggregate = false; + } + + field = fieldName; + } + + // Only add to aggregation if shouldAggregate is true + if (shouldAggregate) { + agg_str += + ',"aggs": { "metric_desc.names.' + + field + + '": { "terms": ' + + '{ "show_term_doc_count_error": true, "size": ' + + bigQuerySize + + ',' + + '"field": "metric_desc.names.' + + field + + '" }'; + field_count++; } - agg_str += - ',"aggs": { "metric_desc.names.' + - field + - '": { "terms": ' + - '{ "show_term_doc_count_error": true, "size": ' + - bigQuerySize + - ',' + - '"field": "metric_desc.names.' + - field + - '" }'; - field_count++; }); while (field_count > 0) { agg_str += '}}'; @@ -2493,6 +2509,7 @@ mgetMetricIdsFromTerms = async function (instance, termsSets, yearDotMonth) { for (i = 0; i < termsSets.length; i++) { var periId = termsSets[i].period; var runId = termsSets[i].run; + var regexpFilters = termsSets[i].regexpFilters || []; Object.keys(termsSets[i].termsByLabel) .sort() .forEach((label) => { @@ -2513,6 +2530,12 @@ mgetMetricIdsFromTerms = async function (instance, termsSets, yearDotMonth) { if (runId != null) { q.query.bool.filter.push(JSON.parse('{"term": {"run.run-uuid": "' + runId + '"}}')); } + // Apply any regexp filters that were excluded from aggregation + regexpFilters.forEach((rf) => { + q.query.bool.filter.push( + JSON.parse('{"regexp": {"metric_desc.names.' + rf.field + '": ' + JSON.stringify(rf.pattern) + '}}') + ); + }); jsonArr.push('{}'); jsonArr.push(JSON.stringify(q)); totalReqs++; @@ -2609,14 +2632,48 @@ getMetricGroupsFromBreakouts = async function (instance, sets, yearDotMonth) { q.query.bool.filter.push(JSON.parse('{"term": {"run.run-uuid": "' + set.run + '"}}')); } // If the breakout contains a match requirement (something like "host=myhost"), then we must add a term filter for it. - // Eventually it would be nice to have something other than a match, like a regex: host=/^client/. + // Multiple values can be specified with commas: "host=a,b,c" which will match any of those values. + // Regex patterns can be specified with r/pattern/ (separate metrics) or R/pattern/ (aggregated metric). var regExp = /([^\=]+)\=([^\=]+)/; set.breakout.forEach((field) => { var matches = regExp.exec(field); if (matches) { field = matches[1]; value = matches[2]; - q.query.bool.filter.push(JSON.parse('{"term": {"metric_desc.names.' + field + '": "' + value + '"}}')); + + // Check if it's a regex pattern: r/pattern/ or R/pattern/ + // Group 1: r or R (lowercase = separate metrics, uppercase = aggregated) + // Group 2: delimiter character (usually /, but can be any char) + // Group 3: the actual regex pattern + // \2: backreference to ensure matching closing delimiter + var regexMatch = /^([rR])(.)(.+)\2$/.exec(value); + + if (regexMatch) { + // It's a regex pattern + var isAggregated = regexMatch[1] === 'R'; + var delimiter = regexMatch[2]; + var pattern = regexMatch[3]; + + // Add regexp filter to OpenSearch query + // Both r/pattern/ and R/pattern/ use the same filter, + // the difference is in the aggregation (handled in getBreakoutAggregation) + q.query.bool.filter.push( + JSON.parse('{"regexp": {"metric_desc.names.' + field + '": ' + JSON.stringify(pattern) + '}}') + ); + } else { + // Not a regex pattern, handle as literal value(s) + // Check if the value contains multiple comma-separated values + var values = value.split(','); + if (values.length > 1) { + // Multiple values: use "terms" query (note the plural) + q.query.bool.filter.push( + JSON.parse('{"terms": {"metric_desc.names.' + field + '": ' + JSON.stringify(values) + '}}') + ); + } else { + // Single value: use "term" query (singular) + q.query.bool.filter.push(JSON.parse('{"term": {"metric_desc.names.' + field + '": "' + value + '"}}')); + } + } } }); q.aggs = aggs; @@ -2635,10 +2692,33 @@ getMetricGroupsFromBreakouts = async function (instance, sets, yearDotMonth) { var metricGroupTerms = getMetricGroupTermsFromAgg(responses[idx].aggregations); // Derive the label from each group and organize into a dict, key = label, value = the filter terms var metricGroupTermsByLabel = getMetricGroupTermsByLabel(metricGroupTerms); + + // Extract regexp filters that were excluded from aggregation (R/pattern/) + // These need to be preserved when querying for metric IDs + var regexpFilters = []; + var regExp = /([^\=]+)\=([^\=]+)/; + sets[idx].breakout.forEach((field) => { + var matches = regExp.exec(field); + if (matches) { + var fieldName = matches[1]; + var value = matches[2]; + var regexMatch = /^([rR])(.)(.+)\2$/.exec(value); + if (regexMatch) { + var isAggregated = regexMatch[1] === 'R'; + var pattern = regexMatch[3]; + if (isAggregated) { + // This field was excluded from aggregation, need to preserve the regexp filter + regexpFilters.push({ field: fieldName, pattern: pattern }); + } + } + } + }); + var thisLabelSet = { run: sets[idx].run, period: sets[idx].period, - termsByLabel: metricGroupTermsByLabel + termsByLabel: metricGroupTermsByLabel, + regexpFilters: regexpFilters }; termsSets.push(thisLabelSet); } @@ -3189,6 +3269,46 @@ getMetricDataSets = async function (instance, sets, yearDotMonth) { return { 'ret-code': retCode, 'ret-msg': retMsg }; } var metricGroupIdsByLabelSets = resp['metric-id-sets']; + + // Check if any regex filters resulted in zero matches + for (var idx = 0; idx < metricGroupIdsByLabelSets.length; idx++) { + if (Object.keys(metricGroupIdsByLabelSets[idx]).length === 0) { + // This set has no metric groups - check if it was due to a regex filter + var regexFilters = []; + var regExp = /([^\=]+)\=([^\=]+)/; + sets[idx].breakout.forEach((field) => { + var matches = regExp.exec(field); + if (matches) { + var fieldName = matches[1]; + var value = matches[2]; + // Check if it's a regex pattern + if (/^[rR]./.test(value)) { + regexFilters.push({ field: fieldName, pattern: value }); + } + } + }); + + if (regexFilters.length > 0) { + // Build helpful error message + retMsg = + 'No metrics found matching the specified filter(s) for source=' + + sets[idx].source + + ', type=' + + sets[idx].type; + regexFilters.forEach((rf) => { + retMsg += '\n Regex filter ' + rf.field + '=' + rf.pattern + ' did not match any values.'; + }); + retMsg += '\nPlease verify:'; + retMsg += '\n 1. The regex pattern is correct'; + retMsg += '\n 2. Metrics exist for this source/type with the specified field'; + retMsg += '\n 3. The field values match the pattern'; + retCode = 1; + return { 'ret-code': retCode, 'ret-msg': retMsg }; + } + // If no regex filters, continue with existing error handling + } + } + var dataSets = await getMetricDataFromIdsSets(instance, sets, metricGroupIdsByLabelSets, yearDotMonth); if (dataSets.length != sets.length) { diff --git a/queries/cdmq/get-metric-data.js b/queries/cdmq/get-metric-data.js index 38373eab..0f658e90 100644 --- a/queries/cdmq/get-metric-data.js +++ b/queries/cdmq/get-metric-data.js @@ -15,7 +15,53 @@ var sprintf = require('sprintf-js').sprintf; var instances = []; // opensearch instances function list(val) { - return val.split(','); + // Parse breakout string to handle both: + // - Simple breakouts: "hostname,cpu" -> ["hostname", "cpu"] + // - Breakouts with values: "hostname=a,cpu=x" -> ["hostname=a", "cpu=x"] + // - Breakouts with multiple values: "hostname=a,b,cpu=x,y" -> ["hostname=a,b", "cpu=x,y"] + // + // The key insight: a comma separates breakout fields UNLESS we're currently + // parsing a value list (after '=' and before the next field with '=') + + var result = []; + var current = ''; + var inValueList = false; + var parts = val.split(','); + + for (var i = 0; i < parts.length; i++) { + var part = parts[i]; + var hasEqual = part.includes('='); + + if (inValueList && !hasEqual) { + // We're in a value list and this part doesn't have '=', so it's another value + current += ',' + part; + } else if (inValueList && hasEqual) { + // We were in a value list, but now we hit a new key=value pair + result.push(current); + current = part; + inValueList = true; + } else if (!inValueList && hasEqual) { + // Starting a new key=value pair + if (current !== '') { + result.push(current); + } + current = part; + inValueList = true; + } else { + // !inValueList && !hasEqual - simple breakout field without value filter + if (current !== '') { + result.push(current); + } + current = part; + inValueList = false; + } + } + + if (current !== '') { + result.push(current); + } + + return result; } function save_host(host) {