EQUIP2 large dataspace notes

Chris Greenhalgh, 2006-04-27

Introduction

The initial dataspace API and implementations tend to presume that dataspaces will not be particularly large, i.e. will not contain very many items, and queries will not return very large numbers of items. In particular:

Many of the dataspace implementation cache the entire dataspace contents in memory at all times (the persitent versions just serialise this to and from external storage).
The match API returns a single array comprising all results (which are therefore all in memory).
The default session implementation caches all objects in use (including read/matched) for the duration of each session.
Related to this, the process of cacheing and copying forces the data store to fully read the object(s), e.g. thwarting Hibernate's lazy fetching of collections (although this cannot be avoided in the case of a remote dataspace session).

The document starts out as my design notes to try to make EQUIP2 work with larger dataspaces, i.e. potentially thousands or millions of objects.

Approach

Each of the problem areas noted above needs to be addressed. The emphasis will be on (a) large numbers of objects in the dataspace and (b) API support to allow appropriately written applications to perform queries with large result sets.

Plan A is as follows:

For working with large numbers of objects, suitably tailored dataspace implementation are required, which only read objects into memory as required.

Currently there is:

equip2.persist.hibernate.j2se.PersistentDataspace, which is backed by Hibernate O/R system onto a relational database.

Additional options include:

J2ME JMS implementation which does not cache objects in memory.

To avoid caching all objects in the session, two optimisations are possible:

In a "read-only" session, objects do not need to cached to check for application changes/modifications/removals.
In a normal session, a "matchUnmanaged" operation could return unmanaged objects, which similarly would not be monitored for changes, etc, and could be uncached. In general best performance would map this request straight through to the dataspace, and therefore would be defined to not take account of changes made to date within the current session.

To avoid return match results as a monolithic data structure:

a version of match can be created which returns an Enumeration (rather than Iterator, for J2ME compatibility), which can then potentially incrementally fetch the results from the dataspace. The default implementation will just internally step through the results of the array version. But supporting dataspaces can do clever things. This needs to be implemented in ISession (DefaultSession) and also IDataspaceSession (DefaultDataspace and DataspaceConnection).
for remote use, eventually would require more operations in remote protocol to get results in chunks. leave that for later...
this only really saves if results are also unmanaged - only for matchUnmanaged, then?!
number of results to be returned, and starting inder in list can be specified (in QueryTemplate) as per Hibernate Criteria (mainly for e.g. web situations, where views of the objects are paged and navigated through relatively slowly); this is typically combined with ordering constraint(s).
only the count of matches can be returned (if this is all that is required)

Other notes:

to support a session which allows very large numbers of updates/add would require major changes to the way changes are committed, since they would need to be committed in chunks before the end of the session. for most cases, using a number of sessions and appealing to some other mechanism for consistency (e.g. taking site off-line!) between sessions gives a work-around for bulk data upload.
however, there is still some utility in allowing 'write only' sessions. in particular some dataspace implementations (e.g. append to file) might ONLY be able to support such sessions, and other variants of the simple memory cache dataspace could do lazy loading into memory (not on write only/append sessions).

Change Log

2006-07-17

updated with new stuff, e.g. max results

2006-04-27

Created document