Yoshinori Okuji
2005-09-21 20:17:45 UTC
According to a request from S?bastien, I describe how indexing works in the
current implementation briefly.
In the past, ERP5 catalogged objects in one-by-one basis. For each object,
portal_catalog called Z SQL Methods to insert rows into tables. This was
slow, because MySQL invoked its SQL query interpreter each time and needed to
rebuild indices each time. This was slow, also because the cache efficiency
in ZODB was bad.
Now ERP5 groups multiple objects for indexing, using the new functionality in
CMFActivity. The activity SQLDict implements support for group methods and
expand methods. First, I explain group methods.
When we make an active object, this looks like this:
obj.activate().immediateReindexObject()
CMFActivity can be extended arbitrarily by passing optional parameters to
activate:
obj.activate(group_method_id='portal_catalog/catalogObjectList').immediateReindexObject()
This parameter "group_method_id" is simply ignored when an activity does not
recognize it. But SQLDict recognizes it, and applies a special handling for
this active object. In the case of this example, SQLDict tries to gather
active objects which has the same group method id. In the current setting,
SQLDict collects up to 100 objects at a time, and validates each active
object (e.g. checking an after method id). Then, SQLDict obtains objects from
ZODB and calls the group method with the list of those object. So, in
SQLDict, immediateReindexObject is not used at all any longer, while keeping
compatibility.
The method "catalogObjectList" in portal_catalog calls Z SQL Methods with the
list of objects (after filtering). This reduces the number of SQL queries to
MySQL significantly, and so performs better. Also, if objects are related (in
most cases, yes), the ZODB cache hits the same objects with a higher
probability, so this also reduces the load of Zope.
Now, about expand methods. There are some ways to implement
recursiveReindexObject. In the past implementation, recursiveReindexObject
called immediateReindexObject with recursively traversed objects. So, one way
was to call catalogObjectList with a list of traversed objects. However, this
does not allow grouping a recursiveReindexObject call with another or
reindexObject. So I decided to add a new parameter into SQLDict:
expand_method_id.
As you can see in ERP5Type/Document/Folder.py, recursiveReindexObject is like
this:
obj.activate(group_method_id='portal_catalog/catalogObjectList',
expand_method_id='getIndexableChildValueList').recursiveImmediateReindexObject()
As you understand above, when an activity does not recognize group_method_id
or expand_method_id, this just calls recursiveImmediateReindexObject as
before. But SQLDict deals with this in a different way. Because this uses the
same group method as reindexObject, this is grouped with reindexObject. Then,
SQLDict finds an expand method "getIndexableChildValueList" and calls this
method with the object. The result is a list of all child objects, including
the object itself, which are indexable. This result is taken into account for
the group method, and the rest is the same as reindexObject.
Due to this change, portal_catalog does not use Z SQL Methods for one object,
such as z_catalog_category, any longer. Instead, it uses methods for multiple
objects, such as z_catalog_object_list. These methods make use of the
extended inserts specific to MySQL, which can insert multiple rows by a
single query. Although this is specific to MySQL, we can do similar
optimization for PostgreSQL as well (e.g. dropping indices, inserting rows,
and rebuilding indices).
Is this enough?
YO
current implementation briefly.
In the past, ERP5 catalogged objects in one-by-one basis. For each object,
portal_catalog called Z SQL Methods to insert rows into tables. This was
slow, because MySQL invoked its SQL query interpreter each time and needed to
rebuild indices each time. This was slow, also because the cache efficiency
in ZODB was bad.
Now ERP5 groups multiple objects for indexing, using the new functionality in
CMFActivity. The activity SQLDict implements support for group methods and
expand methods. First, I explain group methods.
When we make an active object, this looks like this:
obj.activate().immediateReindexObject()
CMFActivity can be extended arbitrarily by passing optional parameters to
activate:
obj.activate(group_method_id='portal_catalog/catalogObjectList').immediateReindexObject()
This parameter "group_method_id" is simply ignored when an activity does not
recognize it. But SQLDict recognizes it, and applies a special handling for
this active object. In the case of this example, SQLDict tries to gather
active objects which has the same group method id. In the current setting,
SQLDict collects up to 100 objects at a time, and validates each active
object (e.g. checking an after method id). Then, SQLDict obtains objects from
ZODB and calls the group method with the list of those object. So, in
SQLDict, immediateReindexObject is not used at all any longer, while keeping
compatibility.
The method "catalogObjectList" in portal_catalog calls Z SQL Methods with the
list of objects (after filtering). This reduces the number of SQL queries to
MySQL significantly, and so performs better. Also, if objects are related (in
most cases, yes), the ZODB cache hits the same objects with a higher
probability, so this also reduces the load of Zope.
Now, about expand methods. There are some ways to implement
recursiveReindexObject. In the past implementation, recursiveReindexObject
called immediateReindexObject with recursively traversed objects. So, one way
was to call catalogObjectList with a list of traversed objects. However, this
does not allow grouping a recursiveReindexObject call with another or
reindexObject. So I decided to add a new parameter into SQLDict:
expand_method_id.
As you can see in ERP5Type/Document/Folder.py, recursiveReindexObject is like
this:
obj.activate(group_method_id='portal_catalog/catalogObjectList',
expand_method_id='getIndexableChildValueList').recursiveImmediateReindexObject()
As you understand above, when an activity does not recognize group_method_id
or expand_method_id, this just calls recursiveImmediateReindexObject as
before. But SQLDict deals with this in a different way. Because this uses the
same group method as reindexObject, this is grouped with reindexObject. Then,
SQLDict finds an expand method "getIndexableChildValueList" and calls this
method with the object. The result is a list of all child objects, including
the object itself, which are indexable. This result is taken into account for
the group method, and the rest is the same as reindexObject.
Due to this change, portal_catalog does not use Z SQL Methods for one object,
such as z_catalog_category, any longer. Instead, it uses methods for multiple
objects, such as z_catalog_object_list. These methods make use of the
extended inserts specific to MySQL, which can insert multiple rows by a
single query. Although this is specific to MySQL, we can do similar
optimization for PostgreSQL as well (e.g. dropping indices, inserting rows,
and rebuilding indices).
Is this enough?
YO
--
Yoshinori Okuji, Nexedi Research Director
Nexedi: Consulting and Development of Free / Open Source Software
http://www.nexedi.com
ERP5: Free / Open Source ERP Software for small and medium companies
http://www.erp5.org
Storever: OpenBrick, WiFi infrastructure, notebooks and servers
http://www.storever.com
Yoshinori Okuji, Nexedi Research Director
Nexedi: Consulting and Development of Free / Open Source Software
http://www.nexedi.com
ERP5: Free / Open Source ERP Software for small and medium companies
http://www.erp5.org
Storever: OpenBrick, WiFi infrastructure, notebooks and servers
http://www.storever.com