Skip to content

Conversation

@trevorwhitney
Copy link
Collaborator

What this PR does / why we need it:
This PR refactors parse operations (logfmt and json) to be implemented as an operation expression on an expand projection, similar to unwrap, and not as a custom pipeline as was done previously. In doing so, this PR introduces a new operation type of FunctionOp that can have any number of Values/Expressions as arguments, all of which are evaluated before being passed to the registered function, which is registered just on op type (and not arg type, since args are variable).

I also introduced a NamedLiteralExpr which is just a literal with a name. I thought this made adding the requested keys optimization a bit cleaner, but as it's just a literal under the hood, I'm happy to remove it if we think it's uncessary.

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@trevorwhitney trevorwhitney requested a review from a team as a code owner October 23, 2025 17:28

case *physical.NamedLiteralExpr:
return &Scalar{
value: expr.Literal,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the literal being used here in the case of parse is actually an array, which I'm aware is technically not a scalar. the value being passed here though is a pointer to that array, which one could argue is a scalar. that being said, let me know if you'd prefer another type. my thought was to avoid that so we don't need to type check the incoming literal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were moved to the executor tests above

return ExprTypeUnary
}

type NamedLiteralExpr struct {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to push back on just using a Literal instead, but I added this as I think it makes the optimize code cleaner, where we can look specifically for the requestedKeys literal when pushing down projections, rather than looking for any literal of the right type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named literals do feel a little weird to me, especially just for an optimize pass, and when the position of the arguments do matter (the second argument must be the requested keys, not any argument that's a NamedLiteral of requestedKeys).

I think if we want to make hte optimize pass cleaner, we could unpack argument slices into a struct instead:

type parseArguments struct {
  columnToParse Expression 
  requestedKeys Expression 
}

// Unpack unpacks the expression from src into args. Unpack returns 
// an error if there are not exactly 1 or 2 arguments:
//
//  - parse(columnToParse)
//  - parse(columnToParse, requestedKeys)
func (args *parseArguments) Unpack(src []Expression) error { ... } 

// Pack packs args into a dst slice. Returns a new slice if dst isn't 
// large enough. 
func (args *parseArguments) Pack(dst []Expression) []Expression { ... }

Then your optimization pass could use this:

func (r *projectionPushdown) handleParse(expr *FunctionExpr, ...) ([]ColumnExpression, bool) {
  var args parseArguments 
  if err := args.Unpack(expr.Expressions); err != nil {
    // Panic, I guess? 
  }

  if args.requestedKeys == nil {
    // Initialize args.requestedKeys  
  } 
  
  existingKeys, ok := args.requestedKeys.(types.StringListLiteral) 
  ...

  // Copy back over into the FunctionExpr. 
  expr.Arguments = args.Pack(expr.Arguments) 
}

@trevorwhitney trevorwhitney force-pushed the twhitney/refactor-parse branch from f024161 to 40db5ef Compare October 23, 2025 18:01
Comment on lines 52 to 53
case *UnaryOp:
return b.processUnaryOp(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to remove UnaryOp here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I did not, thank you

if u.reg == nil {
u.reg = make(map[types.FunctionOp]Function)
}
// TODO(twhitney): Should the function panic when duplicate keys are registered?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah probably, plus since it'd panic in the init we'd catch it immediately in unit tests rather than being confused about why we're not using the implementation of a function we expected.

}

if sourceColVec == nil {
return nil, nil, fmt.Errorf("parse function arguments did no include a source ColumnVector to parse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, nil, fmt.Errorf("parse function arguments did no include a source ColumnVector to parse")
return nil, nil, fmt.Errorf("parse function arguments did not include a source ColumnVector to parse")

}, input)
var requestedKeys []string
if requestedKeysColVec != nil {
reqKeysValue := requestedKeysColVec.Value(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert that the requestedKeysColVec must be a scalar? Otherwise I think the behaviour will be a little confusing if you happen to pass in an actual vector but only the first row gets used.

Copy link
Collaborator Author

@trevorwhitney trevorwhitney Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #19549 we no longer have scalars, so I might need to check it's a StringListLiteral instead? I'll rebase on @chaudum changes and investigate.


// Clone returns a copy of the [FunctionExpr].
func (e *FunctionExpr) Clone() Expression {
params := make([]Expression, len(e.Expressions))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use cloneExpressions(e.Expressions) here

return ExprTypeUnary
}

type NamedLiteralExpr struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named literals do feel a little weird to me, especially just for an optimize pass, and when the position of the arguments do matter (the second argument must be the requested keys, not any argument that's a NamedLiteral of requestedKeys).

I think if we want to make hte optimize pass cleaner, we could unpack argument slices into a struct instead:

type parseArguments struct {
  columnToParse Expression 
  requestedKeys Expression 
}

// Unpack unpacks the expression from src into args. Unpack returns 
// an error if there are not exactly 1 or 2 arguments:
//
//  - parse(columnToParse)
//  - parse(columnToParse, requestedKeys)
func (args *parseArguments) Unpack(src []Expression) error { ... } 

// Pack packs args into a dst slice. Returns a new slice if dst isn't 
// large enough. 
func (args *parseArguments) Pack(dst []Expression) []Expression { ... }

Then your optimization pass could use this:

func (r *projectionPushdown) handleParse(expr *FunctionExpr, ...) ([]ColumnExpression, bool) {
  var args parseArguments 
  if err := args.Unpack(expr.Expressions); err != nil {
    // Panic, I guess? 
  }

  if args.requestedKeys == nil {
    // Initialize args.requestedKeys  
  } 
  
  existingKeys, ok := args.requestedKeys.(types.StringListLiteral) 
  ...

  // Copy back over into the FunctionExpr. 
  expr.Arguments = args.Pack(expr.Arguments) 
}

}, nil

case *physical.NamedLiteralExpr:
return &Scalar{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged #19549 earlier today, so Scalar won't be available any more. Use NewScalar(expr.Literal, input.NumRows()) instead

GetForSignature(types.FunctionOp) (Function, error)
}

type Function interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we call it VariadicFunction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, I had that in one iteration, naming was hard, I went through a few options, but I'm happy to use Variadic.

Comment on lines 158 to 165
args := make([]ColumnVector, len(expr.Expressions))
for i, arg := range expr.Expressions {
p, err := e.eval(arg, input)
if err != nil {
return nil, err
}
args[i] = p
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a way to optimize this at some point.
Parsing the argument expressions every time for each batch is a lot of overhead, especially also because these are always string literals (aren't they) and therefore have a single value across all rows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, the function argument is the message column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for now. Later, when we support | logfmt foo,bar this may become a problem.

FLOAT64 = Type(arrow.FLOAT64)
TIMESTAMP = Type(arrow.TIMESTAMP)
STRUCT = Type(arrow.STRUCT)
LIST = Type(arrow.LIST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add to Type.String() function

return tStruct{arrowType: arrowType}
}

type tList struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add to the Loki->Arrow type mapping below

case FunctionOpParseJSON:
return "PARSE_JSON"
default:
panic(fmt.Sprintf("unknown unary operator %d", t))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panic(fmt.Sprintf("unknown unary operator %d", t))
panic(fmt.Sprintf("unknown variadic function operator %d", t))

commit 4e5f95f
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 13:47:59 2025 -0600

    test: fix planner tests

commit dfbdcb7
Merge: c997112 68df3ef
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 13:26:39 2025 -0600

    Merge branch 'main' into twhitney/refactor-parse

commit c997112
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 13:24:03 2025 -0600

    chore: fix linting errors

commit 037e337
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 12:54:27 2025 -0600

    test: fix field names in expression test

commit ad6b101
Merge: 79f2cea d4c53e9
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 12:46:34 2025 -0600

    Merge branch 'main' into twhitney/refactor-parse

commit 79f2cea
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 12:44:12 2025 -0600

    test: fix workflow planner test

commit 40db5ef
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 11:38:25 2025 -0600

    chore: clena up a few comments

commit ad91fda
Author: Trevor Whitney <trevorjwhitney@gmail.com>
Date:   Thu Oct 23 11:23:19 2025 -0600

    refactor: implment parse as a projection
@trevorwhitney trevorwhitney force-pushed the twhitney/refactor-parse branch from 4e5f95f to e7ead00 Compare October 24, 2025 20:04
Comment on lines +240 to +241
└── Projection all=true expand=(PARSE_JSON(builtin.message, []))
└── Projection all=true expand=(PARSE_LOGFMT(builtin.message, []))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to merge these projections? maybe in a later PR?

Comment on lines +225 to +226
└── Projection all=true expand=(PARSE_LOGFMT(builtin.message, [bar, request_duration]))
└── Compat src=metadata dst=metadata collision=label
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaudum does the Compat layer need to come before the Projection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants