Skip to content

[SIP-117] Improve SQL parsing #26786

@betodealmeida

Description

@betodealmeida

[SIP-117] Improve SQL parsing

Motivation

The current status of SQL parsing in Superset is not ideal. A few reasons:

  1. We currently use 3 libraries for parsing:
    a. sqlparse is the original library used for SQL parsing. It's non-validating and non-dialect specific, which brings a few challenges.
    b. sqloxide was introduced as an optional dependency, where performance was critical.
    c. sqlglot was introduced in fix(sqlparse): improve table parsing #26476 to fix security issues.
  2. Superset has an interface to SQL parsing in superset.sql_parse.ParsedQuery, but many places in the code call sqlparse directly. This includes complex operations, like injecting RLS rules into SQL queries.
  3. ParsedQuery is sometimes used for single statements, sometimes for multi-statement queries, which could lead to subtle bugs, since it has methods that assume a single statement.
  4. In many functions a query will be repeatedly parsed and converted back to a string, which is inefficient.
  5. Parsing is done using a generic parser that is not dialect specific.

Superset needs robust and reliable SQL parsing, so it can understand and modify queries being run in programmatic ways. Today, many of the manipulations to SQL are done using string functions, because the sqlparse AST is too low-level. When queries are manipulated programmatically, like in RLS injection, the code is complex and hard to follow, for the same reason.

Proposed Change

My proposal is to create 2 new classes that should provide a clean interface to SQL parsing:

class SQLScript:

    """
    Class representing a SQL script (or block) with multiple statements.

    Examples:

        "SELECT 1; SELECT 2"  # 2 statements
        "CREATE TABLE a (b INT)"  # single statement
        ""  # 0 statements

    """

    statements: List[SQLStatement]

    def __init__(self, query: str, engine: str):
        ...


class SQLStatement:

    """
    Class representing a single SQL statement.

    Optionally in the future we might want to have derived classes for
    more specific statements, like DML or query.
    """
     
    def __init__(self, statement: str, engine: str):
        ...

(This is a simplified version, see #26767 for more details.)

These classes will provide:

  1. A clear differentiation between a parsed query (with multiple statements), and a single statement.
  2. Dialect-specific parsing, via a required attribute indicating the DB engine spec engine.
  3. All SQL-parsing related functionality needed by Superset, so that the Superset code will only use these classes for anything related to SQL parsing, introspection, and manipulation. No SQL parsing library should be imported anywhere else in Superset.
  4. Wrapped exceptions, so that no 3rd-party specific exceptions are bubbled up.

Initially these classes will be implemented using sqlglot, since it's fast, easy to install (pure Python), and has support for several dialects. The interface should be agnostic enough that it should be easy to rewrite the classes using a different parsing library in the future, if we ever need to.

New or Changed Public Interfaces

No public interfaces will be changed, but:

  1. sqlparse will be removed as a dependency.
  2. ParsedQuery will be removed, and replaced by SQLScript/SQLStatement.
  3. Indented SQL ("pretty-printed") will look different, since sqlglot formats SQL differently than sqlparse.

New dependencies

None.

Migration Plan and Compatibility

None.

Rejected Alternatives

None.

Metadata

Metadata

Assignees

Labels

sipSuperset Improvement Proposal

Type

No type

Projects

Status

Implemented / Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions