Skip to content

[PoC]: Database level Data Integration#1420

Draft
BobdenOs wants to merge 1 commit intomainfrom
feat/data-integration
Draft

[PoC]: Database level Data Integration#1420
BobdenOs wants to merge 1 commit intomainfrom
feat/data-integration

Conversation

@BobdenOs
Copy link
Contributor

Concept

Wouldn't it be nice if your database could connect to other systems to use their data ? To do this using SAP HANA Cloud it is possible to use a virtual table. It functions like a view on top of a table/view in a remote source. The underlying functioning of a remote source is a specific ODBC driver. One of the more interesting drivers is the OData driver. Which doesn't use a proprietary database connection, but rather relies on HTTP to fetch the data out of a remote system.

With the current INSERT and UPSERT implementation being JSON based. The obvious next step was to check whether postgres and sqlite could trigger an HTTP request. For sqlite with the custom functions being simple javascript functions it was clearly possible. With postgres there are a lot of extensions so of course pgsql-http exists.

Proof of Concept

First thing we need is the "other" system. Simple cds export the catalog service of the bookshop.

cd test/bookshop
cds export ./srv/cat-service.cds

Next consume the data product in the test model.

using {CatalogService} from '../../test/bookshop/apis/CatalogService';

service integration {
  entity Genres as projection on CatalogService.Genres;
  entity Books  as projection on CatalogService.ListOfBooks;
}

Using the @data.product annotation it is possible to identify the entities that are located in the "other" system. Making sure to annotate the entities with cds.persistence.exists so that the compiler will expect them to already be deployed. So ensure to make a view which uses the HTTP function of the database to download the data.

SELECT * FROM http('http://localhost:4004/browse/ListOfBooks')->>'$.value'

If only it was so simple. All the JSON functions produce string values as types have to be consistent. This is where the INSERT logic comes into play. By looking at the elements of the entity it is possible to use the input converter to create the correct SQL data type out of the JSON string values. Allowing the database to process the data as if it was provided by a native view.

SELECT cast(value->>'$.ID' as Integer) AS ID, ... FROM json_each(http('http://localhost:4004/browse/ListOfBooks')->>'$.value')

You might at this point have started to ask "why ?". As there is also service level data integration which does pretty much the same thing. Well the @cap-js/cds-dbs can do a lot of advanced features. Which are non trivial to re implement in the javascript layer. Additionally any javascript implementation of these features wouldn't have a "good" performance. It would both cost a lot of CPU cycles and a lot of memory to achieve the same results. As HANA, postgres and sqlite all are written in C/C++ with their primary focus on optimizing relational data manipulation. So when a query has to do a where, join, group by, order by, path expression or expand the databases have all the optimized tools available in native implementations.

// postgres / sqlite
await cds.ql`SELECT FROM ${Books} { * }`
// [odata] - GET /browse/ListOfBooks
await cds.ql`SELECT FROM ${Books} { *, genre { * } }`
// [odata] - GET /browse/ListOfBooks
// [odata] - GET /browse/Genres

// SAP HANA Cloud OData remote source behavior
await cds.ql`SELECT FROM ${Books} { * }`
// [odata] - GET /browse/ListOfBooks
await cds.ql`SELECT FROM ${Books} { ID }`
// [odata] - GET /browse/ListOfBooks?$select=ID
await cds.ql`SELECT FROM ${Books} { *, genre { * } }`
// [odata] - GET /browse/ListOfBooks?$expand=genre

So while this is a very simple solution it provides a lot of power. Additionally in the case of postgres it is possible to take a more robust implementation by using foreign data wrappers. Which comes with the same benefits (and drawbacks) as SAP HANA Cloud remote sources.

FYI

In the case of sqlite the http function is defined as deterministic. Depending on the interpretation of the word deterministic the function could only be called a single time, but in reality the function will be called once per query. Therefor it is safe to have it defined as deterministic. With the additional benefit that it will only be called once for an expand instead of once for each row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant