EQUIP2 large dataspace notes
Chris Greenhalgh, 2006-04-27
Introduction
The initial dataspace API and implementations tend to presume that
dataspaces will not be particularly large, i.e. will not contain very
many items, and queries will not return very large numbers of items. In
particular:
- Many of the dataspace implementation cache the entire dataspace
contents in memory at all times (the persitent versions just serialise
this to and from external storage).
- The match API returns a single array comprising all results
(which are therefore all in memory).
- The default session implementation caches all objects in use
(including read/matched) for the duration of each session.
- Related to this, the process of cacheing and copying forces the
data store to fully read the object(s), e.g. thwarting Hibernate's lazy
fetching of collections (although this cannot be avoided in the case of
a remote dataspace session).
The document starts out as my design notes to try to make EQUIP2 work
with larger dataspaces, i.e. potentially thousands or millions of
objects.
Approach
Each of the problem areas noted above needs to be addressed. The
emphasis will be on (a) large numbers of objects in the dataspace and
(b) API support to allow appropriately written applications to perform
queries with large result sets.
Plan A is as follows:
- For working with large numbers of objects, suitably tailored
dataspace implementation are required, which only read objects into
memory as required.
- equip2.persist.hibernate.j2se.PersistentDataspace, which is
backed by Hibernate O/R system onto a relational database.
- Additional options include:
- J2ME JMS implementation which does not cache objects in
memory.
- To avoid caching all objects in the session, two optimisations
are possible:
- In a "read-only" session, objects do not need to cached to
check for application changes/modifications/removals.
- In a normal session, a "matchUnmanaged" operation could return
unmanaged objects, which similarly would not be monitored for changes,
etc, and could be uncached. In general best performance would map this
request straight through to the dataspace, and therefore would be
defined to not take account of changes made to date within the current
session.
- To avoid return match results as a monolithic data structure:
- a version of match can be created which returns an Enumeration
(rather than Iterator, for J2ME compatibility), which can then
potentially incrementally fetch the results from the dataspace. The
default implementation will just internally step through the results of
the array version. But supporting dataspaces can do clever things. This
needs to be implemented in ISession (DefaultSession) and also
IDataspaceSession (DefaultDataspace and DataspaceConnection).
- for remote use, eventually would require more operations in
remote protocol to get results in chunks. leave that for later...
- this only really saves if results are also unmanaged - only for
matchUnmanaged, then?!
- number of results to be returned, and starting inder in list
can be specified (in QueryTemplate) as per Hibernate Criteria (mainly
for e.g. web situations, where views of the objects are paged and
navigated through relatively slowly); this is typically combined with
ordering constraint(s).
- only the count of matches can be returned (if this is all that
is required)
Other notes:
- to support a session which allows very large numbers of
updates/add would require major changes to the way changes are
committed, since they would need to be committed in chunks before the
end of the session. for most cases, using a number of sessions and
appealing to some other mechanism for consistency (e.g. taking site
off-line!) between sessions gives a work-around for bulk data upload.
- however, there is still some utility in allowing 'write only'
sessions. in particular some dataspace implementations (e.g. append to
file) might ONLY be able to support such sessions, and other variants
of the simple memory cache dataspace could do lazy loading into memory
(not on write only/append sessions).
Change Log
2006-07-17
- updated with new stuff, e.g. max results
2006-04-27