Monday, September 03, 2007

Around the globe with composable data services

Ayende and Alex have been having an interesting conversation on the subject of data access layer componentization, in the light of some new features that are appearing in Microsoft's Entity Framework and some previous work by Alex on Base4.NET.

You can find some of the most relevant posts of the conversation here, here and here.

I read Ayende's answer last night and Alex answer this morning. I was about to write a comment, but it suddenly grew too much. Alex lives in New Zealand and Ayende, I think, lives in Israel. I am writing this at almost 3:00 PM (GMT-4, Caribbean time). I hope they are sleeping right now, so I will have time to do the usual editing after publishing!

I agree with Alex that most of Ayende's concerns could be addressed by the composable EFx data services Alex envisions (he actually prefers to use the term "dataservers", but I think it is opportune to borrow some jargon from Astoria).

Note: In this case we have used the terms "composable" and "componentization" in the sense that the service can aggregate information from multiple backends under a single conceptual model. Maybe we should find a more explicit term to avoid overlaps with the use of "composability" elsewhere.

Looking at Ayende's diagram I agree he did not get the complete picture Alex was painting. To his favor, one must admit that the composable data services Alex talks about are still not even “vaporware”.

I really just want to add two elements to the conversation:

First, if you added caching of read-only data as a feature of the data service, you would get a better substitute for the ETL process that Ayende mentions (Note To Alex: You can consider this a feature request!).

Second, while Ayende’s preferred solution may look very good and the simplest thing to do in some scenarios, IMHO its main weakness is that it does not scale. Let me try to explain it with an extreme example:

Suppose an enterprise has 5 mayor systems serving 5 departments, each of them with its own data silo. One day, each department contracts a consultant to help them do some data integration with the other systems (not that this should ever happen in real life!).

A few weeks later, each consultant comes up with a solution very much like the one Ayende explains: Each one contains its own schema for the data coming from the other 4 databases, each of these new schemas is fed by a separate ETL process, etc.

Now that the five consultants took their money, let’s analyze what the customer actually got:

ONE TIME COST: Contained in each of the original 5 systems there is a subset of the data that needs to be shared. But instead of sharing it, the consultants decided simultaneously that the easiest path for each of them was to duplicate this data. So, in the end, for this subset of data that needs to be shared, the increase of storage is up to 5 x 4 = 20 fold! This will not only cost hardware: The schemas for this subset have been reinvented up to 20 times too and 5 different ETL processes had to been designed, implemented and tested.

RELIABILITY: For simplicity sake, we will only consider uptime, which measures the ability of the system not to "go down", and not its ability to maintain data consistency. If you do the math, I think you will see that in theory the customer's infrastructure is now more tolerant to failure. However, in "reality" the infrastructure is now much more complex, and hence much of this advantage is "lost to entropy" (every time something goes wrong, fixing it is more complex). You could have instead invested the same money in redundancy for each of the 5 original systems. While two-fold redundancy buys less reliability than five-fold redundancy, most failover solutions won’t add so much complexity.

MAINTAINABILITY: I don’t understand quite well Ayende’s points regarding maintainability, because on the event of a single schema modification, he still needs to at least revise his ETL code. Although the system could surely be kept running for hours on outdated data (improving uptime, not maintainability), eventually he would need to adjust it. In my extreme example, any single schema change can potentially affect all 5 systems! In contrast, if you could create a single compound EFx data service, you would probably just compensate for the changes by adjusting the mapping, and only once. UPDATE: I see I was assuming here a "static" definition of maintainability, that is completely orthogonal with uptime. I may reconsider this argument, but it doesn't affect the main point.

SECURITY: I don’t clearly see Ayende’s point regarding security either. I think you need some means to perform authentication and flexible authorization, and to protect critical data, no matter if you are exposing it as a data service or if you make it available to an ETL process and then to users. Anyway, we still don't know exactly what shape will security take in EFx and Astoria.

PERFORMANCE: How the new system will actually perform is impossible to predict (due to too many factors that are not detailed in the example). However, we can easily observe lots of overhead in moving the same data among several servers. Once you have 5 copies of the data, you will probably see some performance improvement because of locality and parallelism. But the same effect could be achieved in a data service by using caching and conventional scale out measures. In such a case, schemas would not be unnecessarily complicated and consistency would be easier to maintain.

My point is that this data duplication approach, while simple at first, is a path that an organization should not take many times. Once you have, say, three of these processes in operation, it will probably be too much pain to add another one.

This is only how things happens in a fictitious example. And Oren only talked about one system doing this. However, my thesis is that this scenario is not too much detached from how things would go in real life.

I think the consultants would probably not talk much to each other, and they would probably never come up with an integrated solution. Why?

1. Business reasons: Simply put, each consultant is set to do what is best for his project and revenue in the short term, not what is good for their customer in the long term. They will optimize locally, not globally.

2. Most important of all, a technical reason: Unfortunately, there is currently no simple way of accomplishing the integration that the consultants could agree upon. This is precisely the need that composable EFx data services could address.

To satisfy the data integration needs of a company like the one in the example, a new kind of data access technology is needed: One that allows you to easily build data services that are composable, that can extract data from virtually any source, that expose a very high level (conceptual) data interface, which support flexible mapping, and that everyone can talk to using standard protocols.

I think that Alex and I agree that most pieces of this solution are already beginning to appear.

The last paragraphs sound a lot like marketing :D But seriously, if the Data Programmability Team were going to be built such a thing, it would be yet another reason for me to be excited.

1 comment:

Anonymous said...

Nice post Diego, some pretty good points...
And I say as much here

Moving to MSDN

I haven't decided yet, but it is very likely that I will stop blogging here for some time. For some background, I have moved to the sate...