-
Notifications
You must be signed in to change notification settings - Fork 16.5k
Description
[SIP-117] Improve SQL parsing
Motivation
The current status of SQL parsing in Superset is not ideal. A few reasons:
- We currently use 3 libraries for parsing:
a.sqlparseis the original library used for SQL parsing. It's non-validating and non-dialect specific, which brings a few challenges.
b.sqloxidewas introduced as an optional dependency, where performance was critical.
c.sqlglotwas introduced in fix(sqlparse): improve table parsing #26476 to fix security issues. - Superset has an interface to SQL parsing in
superset.sql_parse.ParsedQuery, but many places in the code callsqlparsedirectly. This includes complex operations, like injecting RLS rules into SQL queries. ParsedQueryis sometimes used for single statements, sometimes for multi-statement queries, which could lead to subtle bugs, since it has methods that assume a single statement.- In many functions a query will be repeatedly parsed and converted back to a string, which is inefficient.
- Parsing is done using a generic parser that is not dialect specific.
Superset needs robust and reliable SQL parsing, so it can understand and modify queries being run in programmatic ways. Today, many of the manipulations to SQL are done using string functions, because the sqlparse AST is too low-level. When queries are manipulated programmatically, like in RLS injection, the code is complex and hard to follow, for the same reason.
Proposed Change
My proposal is to create 2 new classes that should provide a clean interface to SQL parsing:
class SQLScript:
"""
Class representing a SQL script (or block) with multiple statements.
Examples:
"SELECT 1; SELECT 2" # 2 statements
"CREATE TABLE a (b INT)" # single statement
"" # 0 statements
"""
statements: List[SQLStatement]
def __init__(self, query: str, engine: str):
...
class SQLStatement:
"""
Class representing a single SQL statement.
Optionally in the future we might want to have derived classes for
more specific statements, like DML or query.
"""
def __init__(self, statement: str, engine: str):
...(This is a simplified version, see #26767 for more details.)
These classes will provide:
- A clear differentiation between a parsed query (with multiple statements), and a single statement.
- Dialect-specific parsing, via a required attribute indicating the DB engine spec
engine. - All SQL-parsing related functionality needed by Superset, so that the Superset code will only use these classes for anything related to SQL parsing, introspection, and manipulation. No SQL parsing library should be imported anywhere else in Superset.
- Wrapped exceptions, so that no 3rd-party specific exceptions are bubbled up.
Initially these classes will be implemented using sqlglot, since it's fast, easy to install (pure Python), and has support for several dialects. The interface should be agnostic enough that it should be easy to rewrite the classes using a different parsing library in the future, if we ever need to.
New or Changed Public Interfaces
No public interfaces will be changed, but:
sqlparsewill be removed as a dependency.ParsedQuerywill be removed, and replaced bySQLScript/SQLStatement.- Indented SQL ("pretty-printed") will look different, since
sqlglotformats SQL differently thansqlparse.
New dependencies
None.
Migration Plan and Compatibility
None.
Rejected Alternatives
None.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status