The Truth About XML was: openEHR Subversion => Github move progress
thomas.beale at oceaninformatics.com
Fri Mar 29 12:14:00 EDT 2013
On 29/03/2013 14:15, Tim Cook wrote:
> Hi Tom,
> I have amended the Subject Line since the thread has diverged a bit.
> [comments inline]
> On Thu, Mar 28, 2013 at 9:55 AM, Thomas Beale
> <thomas.beale at oceaninformatics.com> wrote:
>> one of the problems with LinkEHR (which does have many good features) is
>> that it is driven off XSD. In principle, XSD is already a deformation of any
>> but the most trivial object model, due to its non-OO semantics. As time goes
>> on, it is clear that the XSD expression of data models like openEHR, 13606
>> etc will be more and more heavily optimised for XML data. This guarantees
>> such XSDs will be a further deformation of the original object model - the
>> view that programmers use.
> I agree with you that you cannot represent an object model, fully, in
> XML Schema language.
> However, you seem to promote the idea that object oriented modelling
> is the only information modelling approach.
> This is a critical failure. The are many ways to engineer software
> using many different modelling approaches.
> So abstract information modelling, as you have noted, does not
> necessarily fit all possible software modelling approaches and it is
> unrealistic to think that it does. In desiging the openEHR model you
> chose to use object oriented modelling. The openEHR reference
> implementation uses a rather obscure, though quite pure,
> implementation language, Eiffel. I think that history has shown that
> this has caused some issues in development in other object oriented
I don't see any problem here. The extant open 'reference implementation'
of openEHR has been in Java for years now, and secondarily in Ruby
(openEHR.jp <http://openehr.jp/>) and C# (codeplex.com
<http://openehr.codeplex.com/>). The original Eiffel prototype was from
nearly 10 years ago and was simply how I prototyped things from the GEHR
project, while other OO languages matured.
I am not sure that we have suffered any critical failure - can you point
>> So now if you build archetypes based on the XSD,
>> you are not defining models of object data that software can use (apart from
>> the low layer that deals with XML data conversion). I am unclear how any
>> tool based on XSD can be used for modelling object data (and that's nearly
>> all domain data in the world today, due to the use of object-oriented
>> programming languages).
> I think that if you look, you will find that "nearly all of the domain
> data in the world" exists in SQL models, not object oriented models.
> So this is a rather biased statement designed to fit your message.
> Not a representation of reality.
ok, so I'll clarify what I meant a bit: most domain (i.e. industry
vertical) applications are being written in object languages these days
- Java, Python, C#, C++, Ruby, etc. The software developer's view of
the data is normally via the 'class' construct of those languages. You
are right of course that the vast majority of the data physically
resides in some RDBMS or other. However, the table view isn't the
primary 'model' of the data for I would guess a majority of software
systems development these days. There are of course major exceptions -
systems written totally or mainly in SQL stored procedures or whatever,
but new developments don't tend to go this route. In terms of sheer
amount of data, these latter systems are probably still in the majority
- since tax databases, military systems etc, legacy bank systems are
written this way, but in terms of numbers of software projects, I am
pretty sure the balance is heavily in the other direction.
> That said, the abstract concept of multi-level modelling, where there
> is the separation of a generic reference model from the domain concept
> models is very crucial. Another crucial factor is implementability; as
> promoted by the openEHR Foundation mantra, "implementation,
> implementation, implementation".
> The last and possibly most crucial issue relates to implementability,
> which is the availability of a talent pool and tooling. In order to
> attract more than a handful of users to a technology there needs to
> exist some level of talent as well as robust and commonly available
> The two previous paragraphs are the reasons that the Multi-Level
> Healthcare Information Modelling (MLHIM) project exists.
well, since the primary openEHR projects are in Java, Ruby, C#, PHP,
etc, I don't see where the disconnect between the projects and the
talent pool is. I think if you look at the 'who is using it' pages
<http://www.openehr.org/who_is_using_openehr/>, and also the openEHR
Github projects <https://github.com/openEHR>, you won't find much that
doesn't connect to the mainstream.
> MLHIM is modeled from the ground up around the W3C XML Schema Language 1.1  .
> The reason for this is that the family of XML technologies are the
> most ubiquitous tools throughout the global information processing
> domain today. There is a significant number of open source and
> proprietary tools from parser/validators to various levels of editors,
> readily available. While serious XML development is not taught in all
> university computer science programs, every student does get
> introduced to XML in some manner.
> The relationship of XML with emerging knowledge modelling tools like
> Protégé in languages such as OWL and vocabularies expressed in
> RDF/XML is an obvious advantage. There is an enormous skills pool
> available for using XML data with REST APIs and in translating XML to
> JSON for over-the-wire communications. There are thousands of websites
> with information on how to do these things. It is irrelevant which
> programming language you choose to use; Java, Eiffel, Ruby, Lua,
> Python, etc. there are XML binding tools and access to XML validators.
> There are tried and true methods of storing XML data in SQL databases,
> XML databases and NoSQL databases. XQuery and XPath are very robust
> and well known. Another big advantage is having the ability to do data
> validation using commonly available tools in a complete path; from the
> instance data to concept model to the reference model to the W3C XML
> Schema specification to the W3C XML specification,
<NB: in the below I am talking about the industry standard XSD 1.0, not
the 9-month old XML Schema 1.1 spec>
well I don't really have anything to add to any of that. For the moment,
industry (including openEHR, which publishes XSDs for all its models for
years now) is still using XML, although one has to wonder how long that
will go on
But XML schema as an /information modelling /language has been of no
serious use, primarily because its inheritance model is utterly broken.
There are two competing notions of specialisation - restriction and
extension. Restriction is not a tool you can use in object-land because
the semantics are additive down the inheritance hierarchy, but you can
of course try and use it for constraint modelling. Although it is
generally too weak for anything serious, and most projects I have seen
going this route eventually give in and build tools to interpolate
Schematron statements to do the job properly. Now you have two
languages, plus you are mixing object (additive) and constraint
Add to this the fact that the inheritance rules for XML attributes and
Elements are different, and you have a modelling disaster area.
James Clark, designer of Relax NG, sees inheritance in XML as a design
flaw (from http://www.thaiopensource.com/relaxng/design.html#section:15 ):
... The support for inheritance in W3C XML Schema is
probably the major contributor to the considerable complexity of W3C XML
Schema Part 1. Yet, the inheritance mechanisms in W3C XML Schema do not
allow W3C XML Schema to express any constraints that cannot be expressed
in RELAX NG. Although W3C XML Schema has a very complex type system with
two type hierarchies, one for elements (called substitution groups) and
one for complex types, it supports only single inheritance. However,
modern object-oriented languages, such as Java and C#, support multiple
inheritance (at least for interfaces). Thus, in general the inheritance
structure of a class hierarchy cannot be represented in a schema.
Inheritance has proven to be very useful in modeling languages such as
UML. However, I would argue that trying to make an XML schema language
also be a modeling language is not a good idea. An XML schema language
has to be concerned with syntactic details, such as whether to use
elements or attributes, which are irrelevant to the conceptual model.
Instead, I believe it is better to use a standard modeling language such
as UML, which provides full multiple inheritance, to do conceptual
modeling, and then generate schemas and class definitions from the model
Difficulties in using type restriction (i.e. subtyping) in XSD seem
well-known - here
Not to mention the inability to deal with generic types of any kind,
e.g. Interval<Date>, necessitating the creation of numerous fake types.
And of course, the underlying inefficiency and messiness of the data are
serious problems as well. Google and Facebook (and I think Amazon) don't
move data around internally in XML for this reason.
None of this is to say that XML or XML-schema can't be 'used' - I don't
know of any product or project in openEHR space that doesn't use it
somewhere, and of course it's completely ubiquitous in the general IT
world. What I am saying here is that the minute you try to express your
information model primarily in XSD, you are in a world of pain.
My lessons from projects using XSD are:
* XSDs are good for one thing: describing the contents of XML
documents. That's it.
o but what we need are models that can describe data, software,
documents, documentation, interfaces, etc
* get imported data out of XML as soon as possible, and into a
tractable computational formalism
* treat XSDs as interface specifications, to be generated from the
underlying primary information models, not as any kind of primary
expression in their own right
* Define XSDs with as little inheritance as possible, avoid subtyping,
i.e. define types as standalone, regardless of the duplication.
* Maximise the space optimisation of the data, no matter what it
takes. It usually requires all kinds of tricks, heavy use of XML
attributes, structure flattening from the object model and so on. If
you don't do this, any XML data storage or will cost twice what it
should and web services using XML be horribly slow.
I know there are all kinds of tricks to mitigate these problems, I've
seen a lot of them. The fact that there is a mini-tech sector around XSD
problem mitigation / optimisation testifies to the difficulty of this
XML Schema 1.1 introduces useful things that may reduce some of the
above problems (good overview here
however as far as I can tell, its inheritance model is not much better
than XSD 1.0 (although you can now inherit attributes properly, so
> With XML Schema Language 1.1, we have the ability to build complex
> structures using substitution groups and do very intricate data
> analysis and validation, across models, using XPath in assert
> statements. All without having to resort to RelaxNG or Schematron.
> There are also tools and experience in using XML Schemas to
> automatically generate generic XForms for presentation and data entry.
> So maybe we had to make concessions in deciding to use XML technology
> in MLHIM. However, I cannot think of anything that is missing at this
well I guess the main thing is seamlessness between your information
model and your programming model view. I am not saying it's the only
way, but the approach in openEHR was oriented towards making sure that
expressions of the information model, including all its semantics, are
as close as possible to the software developer's programming model. If
we had done the primary specifications in XML, there would always be a
significant disconnect between the models and the software (actually,
the specs would have been nearly impossible to write). Not to mention,
life would be hard for working with all the other data formats now in
use, including JSON, and various binary formats.
An approach that has emerged in industrial openEHR systems in the last
few years is to /generate /message XSDs from templates - 1 XSD per
template, and write a generic XML <=> canonical data conversion gateway.
This means we can do all modelling in powerful formalisms like UML 2,
EMF/Ecore (for the information models) and all constraint modelling in
ADL / AOM 1.5, and treat XML as one possible data transport.
From what I can see, the major direction in information modelling for
the future will be Eclipse Modelling Framework, using Ecore-based
models. This is where I think the computational expression of openEHR's
Reference Model will move to. The OHT Model Driven Health Tools (MDHT)
project is already showing the way on this, at the same time adopting
ADL 1.5 concepts for constraint modelling.
I have no experience with XSD 1.1, and I think it will be years before
mainstream industry catches up with it. But it may be that it does what
> I want to close by saying that I am grateful for the work done in the
> openEHR community. In my more than ten years of involvement with five
> years on the ARB, I learned a lot! I learned how to do things right
> as well as what can go wrong. MLHIM represents those lessons learned.
we'll obviously differ on our analysis of what is the best modelling
formalism. The above are the conclusions I have come to over the years.
Others may have other, better ideas, and it may be that an XSD 1.1
modelling effort in openEHR could make sense.
I think the key thing would have been to ensure that the archetypes
could be shared across openEHR and MLHIM. Archetypes are pretty widely
used these days, and there are many projects now creating them. I don't
know if this is still possible; if not, it presents clinicians with the
dilemma: model in ADL/AOM, or model in MLHIM? Replicated models aren't
fun to maintain...
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the openEHR-clinical