Opened 12 years ago

Closed 12 years ago

Last modified 8 years ago

#1090 closed enhancement (fixed)

refactor indexing/query parsing, add support for prefixes

Reported by: sascha_silbe Owned by: tomeu
Priority: Unspecified by Maintainer Milestone: Unspecified
Component: Sugar Version: Git as of bugdate
Severity: Unspecified Keywords: r+
Cc: Distribution/OS: Unspecified
Bug Status: New

Description

The attached patch refactors the indexing (term generation) and query parsing code in indexstore.py.
Now adding a new attribute requires only two lines to be added. This will come in quite handy when adding all the attributes proposed by alsroot. :)
In addition, prefix search (e.g. "activity:calculate") and phrase search (e.g. "org.laptop.Calculate") now work (in addition to non-prefix search - e.g. "activity:org.laptop.calculate" and "org.laptop.calculate" usually return the same set of entries).
Date range search is implemented, but doesn't work with current Journal because it appends an asterisk to each term. This also seems to confuse Xapian for certain phrase queries (e.g. "activity:org.lap" doesn't work, but "activity:or" and "activity:org.laptop" do work). Also the date parser isn't very tolerant.

Attachments (2)

sugar-datastore-prefixes.patch (12.4 KB) - added by sascha_silbe 12 years ago.
round 3
0001-IndexStore-refactoring-and-prefix-term-support.patch (13.4 KB) - added by sascha_silbe 12 years ago.
code for round 4

Download all attachments as: .zip

Change History (11)

comment:1 Changed 12 years ago by sascha_silbe

  • Keywords r? added

comment:2 Changed 12 years ago by tomeu

  • Keywords r- added; r? removed

Do you think would be possible to split this patch in one with the refactoring and another(s) with the new feature(s)?

Please see the PEP 8 about style, more specifically:

  • don't use different coding styles in the same module, much less the same file.
  • favour readability over compactness.
  • if you feel the need to add inline comments to make the code more understandable, consider changing the code structure so it's clearer what the code does. Splitting a line in two and a function in two is very cheap in python.
  • consider subclassing xapian.TermGenerator and having _index_prop() there.
  • try to avoid using abbreviations in identifiers.

Thanks for the patch, is it related in some way to http://wiki.sugarlabs.org/go/Features/Plain_Query_Format ?

comment:3 Changed 12 years ago by sascha_silbe

  • Keywords r? added; r- removed

OK, revised my patch. Now subclasses TermGenerator and QueryParser. Checked with pylint and pep8.py.

comment:4 follow-up: Changed 12 years ago by tomeu

  • Keywords r- added; r? removed

I'm unsure about date ranges, do you really expect people will use them without any hint about the expected syntax? I had to look at the code, then the strptime() man page, then run in python time.strftime() in order to know the exact date syntax.

I'm also not happy about the datastore using locale-dependent functions because it means that the DS needs to run in the same locale as the user expects, which puts additional burden on the distributor. Hard to track bugs will appear because of this in different platforms (we have seen this with the older DS implementation).

While I see some value for prefixes in the free text query, I don't see date ranges being really used in the field.

There's also the issue with the Journal appending a * to the query, how do you plan to solve this?

Are we changing anything in what is stored in the index? If so, the index needs to be rebuilt? Any change in the size of the index?

Barring that, the patch looks very good and will be happy to push it soon. Some small issues, only noting the first occurrence:

_query_term_map = { 

See PEP8 about the naming of constants.

        # add full value for dictionary-based searches 
...
        # index with and without prefix so specifying a prefix is optional 
        # in query strings 

If you need comments to explain what your code does, consider restructuring it so it's clearer, or use API comments instead (likely the best option in this case).

        # TODO: change query parser to generate phrase query instead 

Can you extend on this?

        doc.add_term(prefix+value)

What's the value of having a term without any prefix if we are already calling index_text() for that value? See PEP8 about spaces around operators.

        for (name, prefix) in _query_term_map.items(): 

Unneeded parens.

    def _parse_query_term(self, m_name, prefix, m_value): 

No idea what m_ means here, please don't use abbreviations if at all possible.

+            return Query(Query.OP_OR, [
+                self._parse_query_term(m_name, prefix, word)
+                for word in m_value])

Can you split this in several statements? That will also give you the chance to give names to expressions, thus making clearer what is going on there.

+                "Only tuples of size 2 have a defined meaning. " \
+                "Did you mean to pass a list instead?")

Strings don't need \. If we can use ' everywhere instead of ", would be great.

+        return Query(Query.OP_VALUE_RANGE,
+            value_no, str(start), str(end))

Fits in one line.

+            # compatibility option for timestamp: {'start': 0, 'end': 1}

Does this mean that ranges in the query dict would get deprecated by the ranges in the query string?

+        except xapian.QueryParserError, exception:
+            logging.warning("Invalid query string: "+exception.get_msg())
+            return Query()

Hmm, if the exception was propagated, the Journal could say that an error has happened, similar to how we say today that the query returned no results. Wouldn't that be more useful feedback than returning the results of a blank query?

+        queries += [
+            self._parse_query_term(name, prefix, query_dict.pop(name))
+            for (name, prefix) in _query_term_map.items()
+            if name in query_dict]
+
+        queries += [
+            self._parse_query_value(name, value_no, query_dict.pop(name))
+            for (name, value_no) in _query_value_map.items()
+            if name in query_dict]

Please split this in several statements. Not only helps with readability but also keeps future patches more clear.

+        self._query_parser = QueryParser()
+        self._query_parser.set_database(self._database)
+        self._term_generator = TermGenerator()

Are these operations expensive enough to cache those objects? Don't remember seeing them when I profiled the DS.

I'm trying to improve the review process docs, comments welcome:

http://wiki.sugarlabs.org/go/Development_Team/CodeReview

comment:5 in reply to: ↑ 4 ; follow-up: Changed 12 years ago by sascha_silbe

  • Keywords r? added; r- removed

Replying to tomeu:

I'm unsure about date ranges, do you really expect people will use them without any hint about the expected syntax?

I don't expect them to use any non-basic query without proper documentation. The DateRangeProcessor was for mostly for completeness and can be considered an independent, new feature so I've removed it for now. If you think it might still be useful, we can work on a better version (e.g. with the Journal translating the date format) in a separate ticket.

There's also the issue with the Journal appending a * to the query, how do you plan to solve this?

It's been a PITA during testing; would like to switch to FLAG_PARTIAL instead. The semantics are slightly different, though: FLAG_PARTIAL only does an implicit wildcard match on the last word, not on all.

Are we changing anything in what is stored in the index? If so, the index needs to be rebuilt?

Yes and yes. I've not added code for doing the rebuild as I hope to get the rest of my changes into mainline for 0.88 and these already do an index rebuild because of the datastore layout changes.

Any change in the size of the index?

It'll be significantly bigger. We currently store the "known" metadata twice (see the comment quoted below for the reason - basically it's required by Xapian and probably needs upstream changes to fix) and also include position information. The latter is needed for "phrase" searching to work, e.g. "mime_type:text/plain" ("text" and "plain" are considered separate words). Also we store each value both word-split and as a literal term (see below - to be changed before 0.88).

_query_term_map = {

See PEP8 about the naming of constants.

Forgot to rename after moving from class to global. Fixed.

# add full value for dictionary-based searches

...

# index with and without prefix so specifying a prefix is optional
# in query strings

If you need comments to explain what your code does, consider restructuring it so it's clearer, or use API comments instead (likely the best option in this case).

I have reworded those comments to make it clear that they are explanations of _why_ we are doing things this way, not of _what_ we are doing. I don't think splitting the function up (which would be required to turn the comments into docstrings) makes sense here.

# TODO: change query parser to generate phrase query instead

Can you extend on this?

We want the user to be able to search for single words (e.g. for "tags"), so we split words during indexing. But dictionary-based queries currently use the passed value literally, i.e. without word splitting. So we need to store it as a single, unsplit term as well. By changing our dictionary query "parser" (rather: builder) to do word-splitting and using phrase searching, we can drop that duplication, thereby reducing the storage size significantly.

doc.add_term(prefix+value)

What's the value of having a term without any prefix if we are already calling index_text() for that value?

index_text does word splitting, add_term does not.

See PEP8 about spaces around operators.

Looks like neither pylint nor pep8.py check for this. I've changed this occurence, but may have missed others.

for (name, prefix) in _query_term_map.items():

Unneeded parens.

Obviously a matter of taste. Changed.

def _parse_query_term(self, m_name, prefix, m_value):

No idea what m_ means here, please don't use abbreviations if at all possible.

Missed one occurence during refactoring, fixed now.

+ return Query(Query.OP_OR, [
+ self._parse_query_term(m_name, prefix, word)
+ for word in m_value])

Can you split this in several statements? That will also give you the chance to give names to expressions, thus making clearer what is going on there.

Done.

+ "Only tuples of size 2 have a defined meaning. " \
+ "Did you mean to pass a list instead?")

Strings don't need \. If we can use ' everywhere instead of ", would be great.

Changed.

+ # compatibility option for timestamp: {'start': 0, 'end': 1}

Does this mean that ranges in the query dict would get deprecated by the ranges in the query string?

No, deprecated by ranges expressed using tuples. Unlike alsroot, I want to keep the dictionary-based interface and even see it as the primary interface for by most activities.

+ except xapian.QueryParserError, exception:
+ logging.warning("Invalid query string: "+exception.get_msg())
+ return Query()

Hmm, if the exception was propagated, the Journal could say that an error has happened, similar to how we say today that the query returned no results. Wouldn't that be more useful feedback than returning the results of a blank query?

I'm not sure, it should probably be discussed with the Design Team. BTW, Query() matches nothing at all, so isn't the same as passing an blank query into the datastore. So the user is already getting some kind of feedback.

+ queries += [
+ self._parse_query_term(name, prefix, query_dict.pop(name))
+ for (name, prefix) in _query_term_map.items()
+ if name in query_dict]

Please split this in several statements. Not only helps with readability but also keeps future patches more clear.

The only way I could split this into several statements is by doing a regular for loop and adding each item individually using append(). This is probably slower, though I'm not sure if it'll make a real difference.

+ self._query_parser = QueryParser()
+ self._query_parser.set_database(self._database)
+ self._term_generator = TermGenerator()

Are these operations expensive enough to cache those objects? Don't remember seeing them when I profiled the DS.

It seemed wasteful to regenerate them every time and very easy to change. Haven't profiled the DS yet.

I'm trying to improve the review process docs, comments welcome:

An automatic style checker that catches more of your stylistic complaints (string separator, spaces between operators, ...) would be great as it reduces the number of review rounds.

Changed 12 years ago by sascha_silbe

round 3

comment:6 in reply to: ↑ 5 Changed 12 years ago by tomeu

  • Keywords r- added; r? removed

Replying to sascha_silbe:

Replying to tomeu:

I'm unsure about date ranges, do you really expect people will use them without any hint about the expected syntax?

I don't expect them to use any non-basic query without proper documentation. The DateRangeProcessor was for mostly for completeness and can be considered an independent, new feature so I've removed it for now. If you think it might still be useful, we can work on a better version (e.g. with the Journal translating the date format) in a separate ticket.

Well, I cannot really say myself because are far from sugar-using children. We should ask in the mailing list for opinions.

There's also the issue with the Journal appending a * to the query, how do you plan to solve this?

It's been a PITA during testing; would like to switch to FLAG_PARTIAL instead. The semantics are slightly different, though: FLAG_PARTIAL only does an implicit wildcard match on the last word, not on all.

That will change the UI, I remember Marco working on this closely with Eben. Can you get feedback from people on this issue? Otherwise they are going to say we broke the search field ;)

Are we changing anything in what is stored in the index? If so, the index needs to be rebuilt?

Yes and yes. I've not added code for doing the rebuild as I hope to get the rest of my changes into mainline for 0.88 and these already do an index rebuild because of the datastore layout changes.

Sounds good.

Any change in the size of the index?

It'll be significantly bigger. We currently store the "known" metadata twice (see the comment quoted below for the reason - basically it's required by Xapian and probably needs upstream changes to fix) and also include position information. The latter is needed for "phrase" searching to work, e.g. "mime_type:text/plain" ("text" and "plain" are considered separate words). Also we store each value both word-split and as a literal term (see below - to be changed before 0.88).

I guess deployers would be able to help evaluate this. Can you ask in the mailing list addressing specifically dsd and martin langhoff? I guess they will ask you for some numbers. Would be good to know how it impacts performance on the XO.

# TODO: change query parser to generate phrase query instead

Can you extend on this?

We want the user to be able to search for single words (e.g. for "tags"), so we split words during indexing. But dictionary-based queries currently use the passed value literally, i.e. without word splitting. So we need to store it as a single, unsplit term as well. By changing our dictionary query "parser" (rather: builder) to do word-splitting and using phrase searching, we can drop that duplication, thereby reducing the storage size significantly.

Any downsides? How hard would it be to implement that?

+ # compatibility option for timestamp: {'start': 0, 'end': 1}

Does this mean that ranges in the query dict would get deprecated by the ranges in the query string?

No, deprecated by ranges expressed using tuples. Unlike alsroot, I want to keep the dictionary-based interface and even see it as the primary interface for by most activities.

Fine with me, though I don't see much value in changing that.

+ except xapian.QueryParserError, exception:
+ logging.warning("Invalid query string: "+exception.get_msg())
+ return Query()

Hmm, if the exception was propagated, the Journal could say that an error has happened, similar to how we say today that the query returned no results. Wouldn't that be more useful feedback than returning the results of a blank query?

I'm not sure, it should probably be discussed with the Design Team. BTW, Query() matches nothing at all, so isn't the same as passing an blank query into the datastore. So the user is already getting some kind of feedback.

You are right. If you make it raise an exception and enter a ticket about it, I will update the journal.

+ queries += [
+ self._parse_query_term(name, prefix, query_dict.pop(name))
+ for (name, prefix) in _query_term_map.items()
+ if name in query_dict]

Please split this in several statements. Not only helps with readability but also keeps future patches more clear.

The only way I could split this into several statements is by doing a regular for loop and adding each item individually using append(). This is probably slower, though I'm not sure if it'll make a real difference.

I think that's fine, readability comes before non-measured performance.

+ self._query_parser = QueryParser()
+ self._query_parser.set_database(self._database)
+ self._term_generator = TermGenerator()

Are these operations expensive enough to cache those objects? Don't remember seeing them when I profiled the DS.

It seemed wasteful to regenerate them every time and very easy to change. Haven't profiled the DS yet.

Not a big deal, but trading memory usage for reduced cpu usage before measuring doesn't seem like a good idea. Also all variables should have the smallest scope they have, for higher cohesion.

I'm trying to improve the review process docs, comments welcome:

An automatic style checker that catches more of your stylistic complaints (string separator, spaces between operators, ...) would be great as it reduces the number of review rounds.

True, I agree it sucks right now, but if you go browse the gnome and mozilla bugzillas for code reviews, you will see that people are expected to be quite careful about style on their first patches. And I have never seen a C/C++ project that had tools to check style, they just have a page explaining their code style guidelines:

http://developer.gnome.org/doc/guides/programming-guidelines/book1.html
https://developer.mozilla.org/En/Mozilla_Coding_Style_Guide

I'm personally ok with the patch, if you get favorable feedback from UI and OLPC people, we can push it as-is.

comment:7 Changed 12 years ago by sascha_silbe

  • Keywords r? added; r- removed

Bing, entering round 4. ;)

My shiny new test suite revealed some bugs that I fixed. One of the side effects of the fixes is that disk space overhead has been reduced (I had previously overlooked a way to do that) - now only the full value of known properties is redundant. We could potentially restrict that to special properties ('activity' and 'uid'), but in general I don't think there's any way around that duplication as we need to efficiently collect the "unique values".

Your dislike for large list comprehensions has also been addressed - for large numbers of "known properties" it might even be faster now.

Changed 12 years ago by sascha_silbe

code for round 4

comment:8 Changed 12 years ago by tomeu

  • Keywords r+ added; r? removed
  • Resolution set to fixed
  • Status changed from new to closed

Pushed, thanks!

comment:9 Changed 8 years ago by dnarvaez

  • Component changed from sugar-datastore to Sugar
Note: See TracTickets for help on using tickets.