Apache SolR -- Optimize the search results | TYPO3worx

Search experience is a key factor for the acceptance of the on-site search. Apache Solr provides several options influence and optimize the sorting of the search results. This post shows seven ways to achieve this with the TYPO3 Solr extension.

The default query settings are defined in the solrconfig.xml. They can be easily modified using TypoScript. Since version 8 of the Apache Solr extension all settings of the edismax query parser are supported. In the next paragraphs, I will describe the possibilities in short.

Query fields and boosts per field

The query fields define in which fields of the index the search term is looked up. The default query fields for TYPO3 are contained in the following string.

content^40.0 title^5.0 keywords^2.0 tagsH1^5.0 tagsH2H3^3.0 tagsH4H5H6^2.0 tagsInline^1.0

Any field existing in the index can be used for the search. On the other hand not every field in the index must be used for search purposes: They may also contain content to produce the search result, like pictures or links to these.

In TypoScript the configuration for query fields looks like this:

plugin.tx_solr.search.query.queryFields = title^50.0, tagsH1^30.0, tagsH2H3^5.0, tagsH4H5H6^3.0, content^20.0, tagsInline^1.0, keywords^15.0, lastname_textEdgeNgramS^100.0, navTitle^5.0, description^50.0

The difference is, that some more fields were added for searching and the boost factors were adjusted. The speciality of the TypoScript syntax is, that there is a comma between each query field.

The search term is looked up in each field. If the term was found, a score is calculated depending on several measures like the overall number of documents, the number of documents with results, the overall number of the search term and the frequency of the term in one document. This score is multiplied with the boost factor. This factor is appended with an ^ to the field name.

Tie parameter

By default only the highest score of all subqueries / fields defines the position of the record in the search result. This is somewhat contradictory to that, what usually told about search engines: They take all fields into account, when calculating the position within the search result.

The tie parameter let us mimic that behavior to some extend. As the default value of the tie is 0.0, only the highest score of all subqueries is used for the final score. If the tie parameter is set to a value higher than zero, then the scores from all other subqueries are added to the high score with this factor.

The range of the tie parameter is be between 0.0 and 1.0. The Apache Solr documentation stated, that “Typically a low value, like 0.1, is useful”.

Boost queries

Boost queries can be used for example to push certain field types up in the result set. Imagine to search for persons and there is a name quite often in older news entries. On the other hand there is a record of person containing contact data and consultation hours. With the standard configuration the old news will win, because the keyword density is higher than in the person record. Using a boost query can push the person record above the news.

The TypoScript parameter is boostQuery and used like in this codesnippet here:

plugin.tx_solr.search.query.boostQuery = (type:person)^10

Boost functions

The boost function apply, as the name suggests, a function on a score depending on a parameter. Typically this is used to give more recent news a higher rank than the older ones.

plugin.tx_solr.search.query.boostFunction = {!func}recip(ms(NOW,myproject_DateTime_tDateS),3.16e-11,1,1)

This code applies the recip function onto all matching documents. If there is a value in the field myprojectDateTimetDateS, the difference between NOW and this timestamp is calculated in milliseconds. The value 3.16e-11 is about one year in milliseconds. The last two values define how fast the boost value will drop, when the distance between NOW and the field value becomes greater. The higher they are, the faster the boost will drop. The mathematical equation behind that notation is:

Solr notation: recip(x,m,a,b)
Calculation:   a/(m*x+b)

This is just an example of many, many other possible functions. A complete list is available at the Apache Solr documentation: https://lucene.apache.org/solr/guide/6_6/function-queries.html

Phrases

Phrases define how many words may occur between two or more search terms on the specified fields. In this case the result document receives a special boost. The maximum distance between the search terms is defined in the phrase slop parameters.

There are three different phrase and slop fields available. The “normal” phrase field takes all search terms into account. The bigram and trigram phrases build pairs or triplets from the search terms and check whether the condition of the phrase slop matches. To make it more clear, here is a commented code example:

plugin.tx_solr.search.query {

	// Enables the phrase search
	phrase = 1

	// sets the fields for the phrases
	phrase.fields = content^10.0, title^10.0

	// defines the number of words to that a terms has to be moved, in 
	// order to receive a boost
	phrase.slop = 5

	// Enables phrases for pairs of search terms
	bigramPhrase = 1

	// sets the fields for the phrases
	bigramPhrase.fields = description^10.0, abstract^10.0

	// defines the number of words to that a terms has to be moved, in 
	// order to receive a boost
	bigramPhrase.slop = 

	// same here as with  the bigram stuff, but Apache Solr will
	// build triplets of the search terms
	trigramPhrase = 1
	trigramPhrase.fields = 
	trigramPhrase.slop = 
}

MinimumMatch

The minimum match parameter is about how many search terms must be in a document in order to count as a search result. There are several possibilities to set this value. It is possible to define a concrete number of matching terms or a percentage of it. If a negative number is specified, the minimum match is calculated from the number of search terms given by the user. And last but not least, it is also possible to provide a combination of these to have further flexibility.

At the Apache solr documentation there is a very good discussion and explanation about this topic:https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Themm_MinimumShouldMatch_Parameter

Elevation

The last possibility to boost a document is the so called elevation. Elevation means that certain documents are put on top of the search result, regardless of there score they have for the certain query.

First, the elevation must be activated via TypoScript:

plugin.tx_solr.search.elevation = 1

Furthermore the elevation requires a xml document in the same directory , where solr-config.xml resides. The xml document has the following format:

<elevate>
  <query text="foo bar">
	<doc id="1" />
	<doc id="2" />
	<doc id="3" />
  </query>

  <query text="ipod">
	<doc id="MA147LL/A" />  <!-- put the actual ipod at the top -->
	<doc id="IW-02" exclude="true" /> <!-- exclude this cable -->
  </query>
</elevate>

This example xml would take care of displaying the document 1, 2 and 3 always at the first three positions. After these, the organic search results will be displayed. The second query will put the first document at the top and will exclude the second one from the result set.

Conclusion

As you can see, there is a wide variety of options to enhance the search result experience for your users and customers. This also means that getting a reasonable search experience is not a one stop solution. It is mainly a process to get to that point by understanding the search users and discussing the results with the customer. If you have decent knowledge about how the score is calculated and how you can test and debug the query settings, the discussions are much easier. The next issue of this series will cover this topic.

Credits

I want to thank all my sponsors, who make this blog possible. Thanks for all your long lasting engagement. If you appreciate it too and want to support me, you can do it via several ways. Just click the button for all possibilities.

Say "Thank you!"

I found the blog post image on pixabay. It was published by Tomasz Proszek under the CC0 public domain license. It was modified by myself using pablo on buffer.