Searching

Requirements

We need to support searching in the form of Restrictions. See [MS-OXCDATA] Section 2.12.

Restrictions come in various RestrictionTypes. From [MS-OXCDATA]:

Restrictions are sent to the server with the RopFindRow ([MS-OXCROPS] section 2.2.5.13), RopRestrict ([MS-OXCROPS] section 2.2.5.3), RopSetSearchCriteria ([MS-OXCROPS] section 2.2.4.4), and RopSynchronizationConfigure ([MS-OXCROPS] section 2.2.13.1) requests, and are returned from the RopGetSearchCriteria ([MS-OXCROPS] section 2.2.4.5) request.

There are 12 different restriction packet formats: Six of them (AndRestriction, OrRestriction, NotRestriction, SubObjectRestriction, CommentRestriction, and CountRestriction) are used to construct more complicated restrictions from one or more simpler ones. The other six types (ContentRestriction, PropertyRestriction, ComparePropertiesRestriction, BitMaskRestriction, SizeRestriction, and ExistRestriction) specify specific tests based on the properties of an item.

ContentRestriction is a test that some property value is roughly (as specified by ulFuzzyLevel) equal to the requirement
PropertyRestriction is a test between a property value and the requirement, using one of the following operators: <, >, <=, =>, ==, !=, and "is member of distribution list"
ComparePropertiesRestriction is a test between two property values, using one of the following operators: <, >, <=, =>, ==, !=, and "is member of distribution list"
BitMaskRestriction is a test between a property value and the requirement, using one of the following operators: (TBD - BitmapRelOp)
ExistRestriction is a test that a property value exists (i.e. the object has a property with a specified PropTag)
CommentRestriction: TBD
CountRestriction: TBD
AnnotationRestriction: TBD
AndRestriction, OrRestriction, NotRestriction are used to modify other restrictions. AndRestriction and OrRestriction are logical operators on two or more sub-restrictions. NotRestriction inverts the output of a single sub-restriction.

There are complications over doing property comparison operations over properties that are multi-valued (e.g. a comparison against a PtyMultString).

There is a fairly complex example in [MS-OXCDATA] section 3.1.

[MS-OXOSRCH] Section 2.2.3 has a set of templates used by Outlook (and presumably Windows Search), which could give us some hints on what we could optimise for.

Implementation options

Xapian looks like a good option. Needs further investigation.

Discussion with Olly Betts on IRC:

[21:35] <bradh_laptop> a newbie question: does xapian treat everything as a string? For example, if I call Xapian::Document::add_value() with a number in hex, then do the query in decimal, will that match?
[21:36] <ojwb> it's a string
[21:36] <ojwb> if you're putting numbers in a value slot, then you probably want to use Xapian::sortable_serialise()
[21:37] <ojwb> then you can sort, do ranges, etc
[21:37] <bradh_laptop> I'm looking at Xapian as part of the openchange (Microsoft Exchange protocol) server (where the protocol treats messages as a set of key/value pairs).
[21:39] <ojwb> so are the keys and values there typed or just strings?
[21:39] <bradh_laptop> ojwb: typed.
[21:39] <bradh_laptop> the keys are 16 bit property numbers, plus 16 bit types.
[21:40] <bradh_laptop> (i.e. the key & 0xffff tells you the type).
[21:41] <bradh_laptop> bool, uint16, uint32, uint64, 8 bit strings, ucs2 strings (which we convert to utf8 for sanity), guids, binary arrays
[21:42] <bradh_laptop> plus array versions of most of those.
[21:43] <bradh_laptop> so the PidTagBody property is the plain text of the body of the message. PidTagRecipientTo is the list of recipients.
[21:44] <bradh_laptop> the send time comes as a pair of 32 bit values, and recieve time as another pair.
[21:44] <bradh_laptop> we do have a serialisation convention already though.
[21:45] <bradh_laptop> just interested in the impedance mismatch for case like "greater than or equal to" 
[21:45] <ojwb> ok, well how best to index a key/value pair depends how you want to use it for searching
[21:45] <ojwb> hang on, need to afk for a couple of minutes
[21:45] <bradh_laptop> np
[21:46] <bradh_laptop> appreciate its already pretty late on Friday night :-)
[21:50] <bradh_laptop> http://msdn.microsoft.com/en-us/library/ee201126%28v=EXCHG.80%29.aspx is the gory details of the different searching operations I might need to do. Some of those (like ComparePropertiesRestriction, which compares two properties) don't look like the thing that a search engine would do. Perhaps some kind of filtering? 
[21:50] <ojwb> ok, well if you want to do things like greater than, you're going to want to use a value, and serialise such that the order is right
[21:50] <bradh_laptop> http://msdn.microsoft.com/en-us/library/ee178574%28v=EXCHG.80%29.aspx is an example for a realistic search.
[21:52] <bradh_laptop> bitmask doesn't look too friendly in a string either...
[21:52] <ojwb> if you're using xapian 1.2, you can write a PostingSource subclass to do arbitrary filtering based on a value slot in each document
[21:52] <ojwb> which could do that
[21:53] <ojwb> it's better to filter during the search than have to generate a huge number of results and cull them afterwards
[21:54] * bradh_laptop looks at the API docs.
[21:55] <ojwb> http://xapian.org/docs/postingsource gives an overview
[21:56] <bradh_laptop> ok, thats not going to be trivial to understand :-)
[21:58] <bradh_laptop> so does it seem more natural to treat the message body as a Xapian::Document, and add the extra properties with add_value()?
[21:58] <bradh_laptop> or would it be better to index each of the values as a separate "document" then try to combine later?
[21:59] <ojwb> i think the "document" has to be the message here
[21:59] <ojwb> that's what you want to retrieve, ultimately
[21:59] <bradh_laptop> actually, I just need the id of it.
[22:00] <ojwb> sure, but I mean in logical terms
[22:00] <bradh_laptop> the protocol allows the client (outlook, usually) to only pull down some parts.
[22:00] <bradh_laptop> but yes.
[22:00] <bradh_laptop> conceptually I want to do stuff to a mail message.
[22:02] <bradh_laptop> is there much overhead difference in using extra databases? I was thinking of one-per-user, but one-per-folder is also a possibility.
[22:03] <ojwb> not a huge one, but I don't think anyone's done very extensive profiling of it
[22:04] <bradh_laptop> also, can add_value cope with say a few thousand different keys? Perhaps a hundred or so on each message?
[22:04] <ojwb> per folder might make sense, as you can often just use a single one, or a subset of them
[22:04] <ojwb> the number of values shouldn't be a big issue
[22:04] <-- lidaibin has left this server (Quit: Leaving.).
[22:05] <ojwb> so long as you're using 1.2 at least
[22:05] <bradh_laptop> I'm good with requiring a recent release.
[22:05] <ojwb> the default backend in 1.0 stored them all together per document, so loading a single value was O(# of values in that document)
[22:05] <bradh_laptop> The server is built on samba4, and its only alpha...
[22:06] <bradh_laptop> ouch.
[22:07] <ojwb> though 100 values wouldn't be 100 times slower than 1
[22:08] <bradh_laptop> O(1) was what I had in mind :-)
[22:08] <ojwb> the default backend in 1.2 stores them a chunked stream for each slot
[22:12] <bradh_laptop> ok. sounds like I need to sort out my serialisation format, and try some tests.
[22:14] <bradh_laptop> in the glossary, it talks about "incremental modifications" for the chert database format . I take it that (conceptually) is removal of the original document and re-adding the modified version.
[22:20] <ojwb> i think it just means "you don't have to rebuild the DB from scratch to change it" 
[22:21] <ojwb> which is pretty rare in modern systems, but used to be fairly common
[22:23] <ojwb> but it's true that you can also update documents, which is conceptually as you say
[22:24] <bradh_laptop> ojwb: thanks very much for your time and expertise. much appreciated.
[22:24] <bradh_laptop> are you OK with me attaching the IRC log to the search design page on our wiki?
[22:25] <ojwb> sure
[22:25] <ojwb> feel free to ask if you've any questions
[22:26] <ojwb> in particular I suspect the postingsource docs could probably do with improving as it's pretty new
[22:26] <ojwb> i noticed just now that it just says "FIXME" where the examples should be

Also available in: HTML TXT