Archetype relational mapping - a practical openEHR persistence solution

Bert Verhees bert.verhees at rosa.nl
Fri Feb 12 14:49:07 EST 2016


On 12-02-16 18:26, Erik Sundvall wrote:
> if you are experimenting with open source native XML DBs for openEHR, 
> it preformed well for "clinical" patient-specific querying even though 
> all xml databases we tested were not suitable for ad hoc 
> epidemiological population queries (without query specific indexing)


A very interesting paper. I have some first opinions on that. But first 
I need to explain what I think about the matter.
I have not prepared the story below, so there may be things which I 
write to fast. See it as provisional view, not as a hard opinion.
---------
There are relational database-configurations for OLAP and for OLTP. The 
combination is hard to find. There are reasons. This is a classic problem.

You need specific indexes for data-mining (OLAP), and for every extra 
data-mining query you need more indexes, especially if you don't have 
time to wait a night for the result. Those extra indexes stand in the 
way for transactional processing (OLTP) because they need to be updated, 
and that is unnecessary burden for the OLTP-processes, as longer as the 
database exist, the burden becomes heavier.
That is why OLAP and OLTP are not often combined in one configuration.

So many professional databases have extra features for OLAP, I worked, 
years ago with Oracle for a big Dutch television company, and my main 
job was to create indexes for marketing purposes.
We ran those unusual queries during the night and stored the result in 
special tables, Oracle called them "materialized views".
The day after, those views were processed in analyzing software, like 
SPSS, and after that, thrown away.

It was a database with 900,000 persons in it, and every person had a lot 
of history of web-history, personal interests, etc.
"How much interest does a person have for opera, and is he able to pay 
for opera, is it worth to call him for a ticket-offer, we cannot call 
900,000 persons"
These were complex queries based on things the customer bought, 
television programs he was interested in, web-activities.
That was the kind of thing they did with the database.

So this could well be compared with a medical database, regarding to 
size and complexity.

The same difficulties count for XML databases. That is why XML databases 
have also features for creating extra indexes.
Oracle, by the way, if it knows the structure of XML (via XSD), it 
breaks, underwater, XML into relational data, and store it in a 
relational database. It also converts XQuery to SQL.
In this way, it has the weakness and advantages of a relational 
database, and it needs the extra indexes for unusual queries, but on the 
developer view it is XML.
-------
Comparing XML and relational, for OpenEHR, I favor XML, because it can 
easily reflect the structures which need to be stored. It makes the 
data-processing layer less complex. There is a lot of tooling around 
XML, XML-schema to make the database-structure known to Oracle, 
Schematron to validate against archetypes. This is very matured 
software, and therefor the complexity is solved years ago, and well 
tested. It is hidden complexity, and matured hidden complexity is no 
problem.
--------
And if you want to do data-mining, like epidemiological research, and 
you have the time to plan the research, then the classical database, XML 
or RDB, is OK.

In my opinion, there is not often a real need for adhoc data-mining 
(epidemiological research) queries, with result in a few minutes. They 
are always planned, and creating the indexes and storing the result in  
"materialized views" are part of the work one has to do for data-mining 
research on data.
So, I don't think there is a real need for this.
--------
Regarding to XML databases, Oracle has a solution, which can perform 
well if it is professionally maintained.
This is often a point, because professional Oracle maintaining regarding 
to advanced use is very expensive.
Another company is MarkLogic. It is said that MarkLogic is better, but I 
don't know that from own experience.
Both are free to use for developers.
You must think of numbers to 35,000 Euro a year for licenses, which is 
not very much for a big hospital, but very much for a small health service
The open source XML ExistDB database is not very good for data-mining, 
is my personal experience.

So, we must ask ourselves, are we solving a problem that no one experiences?
--------
There are a few advantages to OpenEHR. Data are immutable, never 
changed, never deleted. This makes a few difficult steps unnecessary.
The Dewey concepts look very attractive, although it is also created 
with deleting and changing data in mind.
This is very important for normal company use.

But, as said, we don't have that in OpenEHR. Medical data always grow, 
it are always new data. An event that has passed will never change.
The only things that change (in OpenEHR they are versioned, but from the 
user perspective, they change), that are demographic data. And one can 
live with that, create extra provisions for that demographic 
database-section, which is only a small part of the complete database.
Often, the demographic database is external anyway.

So, my thoughts, maybe Dewey is too good.
Path-values (to leaf-nodes) storage is enough
The paths, combined with ID's are keys, and are much alike XPath, so it 
is easy to store XML in a path-value database.
And querying is also easy, because all queries are path based.

I think, for OpenEHR this is the fastest solution. But maybe I overlook 
something.
Maybe I say something stupid. I am not offended if you say so (maybe in 
other words ;-).
We all want and need to learn.

What we still need to do to build this solution is handle the XQuery 
grammar and let it run on path-based-database.
This is not very easy, but also not very hard. Maybe the algorithms are 
already to find.
Like the Dewey algorithm this can run on any database, also on free open 
source databases.
I think you get an excellent performance on Postgres.

A path-value database is easy to index, it only needs a few. The 
inserting will stay very fast, always.

Lets do some calculations, for fun:
How many path-value-combinations do you have to store for an average 
composition?
Maybe 30? How many compositions are for an average patient? 10,000?
So every patient needs 300,000 path-values. So you can store 10,000 
patients in 3 billion records.
This is not much for Postgres, and the simple indexes needed.
When you need to store 900,000 patients you need 90 separate tables.

Very cheap, very fast, also for adhoc queries, and easy to accomplish, I 
think.

I am very interested in opinions on this.

Thanks
Bert






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openehr.org/pipermail/openehr-technical_lists.openehr.org/attachments/20160212/c0fe6ed6/attachment-0002.html>


More information about the openEHR-technical mailing list