• Projects
  • MnogoSearch optimization example
 
17. 3. 2011

MnogoSearch optimization example

To install mnoGoSearch follow the instruction provided by the mnoGoSearch T3 plugin. Here are quick notes for the Debian systems and MySQL database.

  1. Put the source files in your favorite location. I usually use /usr/local/src.
  2. In configure.in find following line and replace 1 with 0
    A=:C_DEFINE([HAVE_PGSQL], [1], [Define if you want to use PostgreSQL])
  3. ./configure --prefix=/opt/mnogosearch --disable-mp3 --disable-news --without-debug --with-pgsql=no --with-freetds=no --with-oracle8=no --with-oracle8i=no --with-iodbc=no --with-unixODBC=no --with-db2=no --with-solid=no --with-openlink=no --with-easysoft=no --with-sapdb=no --with-ibase=no --with-ctlib=no --with-zlib --with-mysql --disable-syslog
    (Attention! You have to install all required development packages, e.g. php5-dev, 
    libmysqlclient15-dev,.. Inspect the results of configure.)
  4. Open file include/udm_autoconf.h and change it like this:
    /* #undef HAVE_PGSQL */ v
    #undef HAVE_PGSQL
  5. make
  6. I prefer to use checkinstall. You can choose packet name mnogosearch, since native Debian packages have different names. You can also run plain "make install".
    checkinstall make install
  7. cd php
  8. phpize
  9. ./configure --with-mnogosearch=/opt/mnogosearch
  10. Open php_mnogo.c and put
    #undef HAVE_PGSQL
    after the line
    #include "php.h"
  11. checkinstall make install 
    For the name of the package you could use mnogosearch-php
  12. Install package php5-cliv
  13. In used php.ini, e.g, /etc/php5/apache2/php.ini add
    extension=mnogosearch.so
  14. Restart apache2
  15. Check in mnogoSearch module is loaded: 
    echo "<?phpinfo();?>" | php | grep -i mnogosearch
    (Attention! CLI uses /etc/php5/cli/php.ini)
    Search for, "mnoGoSearch Support => enabled"
  16. Create /etc/cron.d/mnogosearch with following content:
    # /etc/cron.d/mnogosearch: crontab fragment for mnogosearch T3 extension
    # m h dom mon dow       user  command
    0 3 * * * www-data /usr/bin/php5 -q /var/www/www.example.org/data/typo3/cli_dispatch.phpsh mnogosearch -w -n &>/dev/null
    (Attention! If you run PHP code under different username, you should replace www-data) 
  17. Add BE user _cli_mnogosearch
  18. Create MySQL database and user that should be used to access it
  19. cd  /opt/mnogosearch/etc
  20. cp indexer.conf-dist indexer.conf
  21. Change following line:
    DBAddr mysql://mnogosearch:YourPassword@localhost:3306/mnogosearch/?dbmode=blob
    (Attention! In given example, database and user are both named mnogosearch) 
  22. Create tables:
    /opt/mnogosearch/sbin/indexer -Ecreate
  23. Install T3 plugin
  24. Create /root_of_your_webpage/robots.txt with following content:
    User-agent: *
    Disallow: /fileadmin
    Disallow: /typo3
    Allow: /

Setup

Please refer to the user manual of the T3 mnogoSearch plugin for detailed instructions. This is our basic setup:

Configuration type Indexing path Indexing method
Server www.example.org/typo3/ Disallow
Server www.example.org/projects/ Allow
Server www.example.org/science/ Allow

First run

Run the indexing and time the execution:

time /usr/bin/php5 -q /var/www/www.example.org/typo3/cli_dispatch.phpsh mnogosearch -n

You can follow the indexing proces in the access log of your web server, e.g:

tail -f /var/log/apache2/www.example.org-access.log

Check the statistics:

/opt/mnogosearch/sbin/indexer -S

During the setup and multiple test runs you might want to clear the index table, use:

/usr/bin/php5 -q /var/www/www.example.org/typo3/cli_dispatch.phpsh mnogosearch -x -Cw

If you want to reindex even documents that have not expired, use:

time /usr/bin/php5 -q /var/www/www.example.org/typo3/cli_dispatch.phpsh mnogosearch -n -x -a

Plugin documantation describes following parameters:

Parameter Description
-c Only check and create database if necessary. Do not reindex.
-d Display generated indexer configuration and exit.
-n Force reindexing of new URLs (normally should be set).
-p pid Process indexing configuration only from this pid.
-w Create statistic for misspelled words. Useful only if Ispell dictionaries are included to mnoGoSearch configuration (see mnoGoSearch documentation).
--dry-run Show what will be done (not applicable to -d and -E).
-h, --help, -? Display this help message.
-x Pass the argument to mnoGoSearch indexer.
-v level Be verbose. Level is 0-5. Default is 0 (complete silence).

You can list all parameters of indexer with:

/opt/mnogosearch/sbin/indexer -h

A good explanation can be found on the products site.

Problems

MnoGoSearch extention creates configuration files in /tmp on your server. Since it includes password to your database it could be wise to delete this files. However, they can be ussefull to study the effects of your BE rules in the final configuration.

Resources

  1. Mnogosearch plugin documentation
  2. Mnogosearch documentation

tt_news

Mnogosearch plugin's manual suggests that tt_news articles should probably be indexed from the database. This is true for sites where you have only one single page or you don't mind that all search results point to the same single view page. If we wish that search results take into account the Single view page that is based on the first news category of the article or you have any other "more complex" tt_news setup, we can still index tt_news articles as normal pages.

If we use Facebook Like or any other social networking tools to spread the word about our articles, each article should have only one Single View page (unicate URL). Keeping the URLs for the same content as unicate as possible seams reasonable. This is also a good practice from the search engine's point of view, e.g., you get more "points" from Google, although, it can be guided with canonical tag. But, that way we are dealing with the consequences and not the source of the "problem", so it should be avoided, if possible (IMHO).

When indexing news, we wish to exclude content elements that would result in "false positives" during the search. Sources of this nature will depend on your information infrastructure. In most cases this will include titles and abstracts of articles in the List views that are used as teasers on various pages around the site. We wish that mnoGoSearch's results point only to Single view of articles that contain search term.

The quick list:

  1. After the initial run, exclude Single view pages from search. Go to page properties and select "Disable" in the "Include in Search" option in the "Behaviour" tab. If search results return two hits for each article, one with the title of the article and one with the title of the Single view page, you should recheck if you have excluded this page from search.
  2. Use the <!--TYPO3SEARCH_begin--> and <!--TYPO3SEARCH_end--> to mark parts of the page that should be indexed.

If you exclude Single view pages before the initial run, Single view pages of your articles will not get indexed. The first point is kind of a hack, I suppose. Keep it in mind, since it might couose problems if you clear the mnoGoSearch tables with URLs. For the second point we are using principle to index everything that is not excluded. Therefore, we begin the content on the page with and we put at the end. Next, we exclude the parts that we do not want to index. Take care that markers are not nested. Here is an example for the List view template:

TypoScript 
  1. <!-- ###TEMPLATE_LIST### begin
  2.  This is the template for the list of news-->
  3.     <!--TYPO3SEARCH_end-->
  4.       <div class="news_list">
  5.         <!-- ###CONTENT### begin
  6.          This is the part with the list of news:-->
  7.           <!-- ###NEWS### begin
  8.            Template for a single item-->
  9.             <div class="news_item">
  10.               <div class="date">###NEWS_DATE###</div>
  11.               <div class="title"><!--###LINK_ITEM###-->
  12.                  ###NEWS_TITLE###<!--###LINK_ITEM###-->
  13.               </div>
  14.             </div>
  15.             <!-- ###NEWS### end-->
  16.         <!-- ###CONTENT###  end -->
  17.       </div>
  18.     <!--TYPO3SEARCH_begin-->
  19. <!-- ###TEMPLATE_LIST### end -->
<!-- ###TEMPLATE_LIST### begin
  This is the template for the list of news-->
    <!--TYPO3SEARCH_end-->
      <div class="news_list">
        <!-- ###CONTENT### begin
          This is the part with the list of news:-->
          <!-- ###NEWS### begin
            Template for a single item-->
            <div class="news_item">
              <div class="date">###NEWS_DATE###</div>
              <div class="title"><!--###LINK_ITEM###-->
                 ###NEWS_TITLE###<!--###LINK_ITEM###-->
              </div>
            </div>
            <!-- ###NEWS### end-->
        <!-- ###CONTENT###  end -->
      </div>
    <!--TYPO3SEARCH_begin-->
<!-- ###TEMPLATE_LIST### end -->

More on specification of the web space for indexing ... Perhaps, I should mention that mnoGoSearch was cought in the loop during the setup when a dozen rules of the type Realm and Comparison type String were used. When I optimized rules and used regular expressions, the loop was gone. I did try to manually run the indexer, but did not perform any deeper research of the problem, since final setup was working for me (at this time ;-).

t3blog

Mark unwanted content

This is done with the use of the markers and . We decided to exclude all lists with the following TypoScript:

TypoScript 
  1. plugin.tx_t3blog_pi1 {
  2.   views {
  3.     list.10.wrap = <!--TYPO3SEARCH_end--> <h2>|</h2> <!--TYPO3SEARCH_begin-->
  4.     list.20.wrap = <!--TYPO3SEARCH_end--> <div class="news_list"> | </div> <!--TYPO3SEARCH_begin-->
  5.   } 
  6.   blogList {
  7.     singleNavigation.wrap = <!--TYPO3SEARCH_end--> <div id="singleNavigation">|</div> <!--TYPO3SEARCH_begin-->
  8.   }
  9.   archive {  
  10.     listWrap.10.dataWrap = <!--TYPO3SEARCH_end--> <ul id="archive_{field:id}" class="{field:class}"> |  </ul> <!--TYPO3SEARCH_begin-->
  11.   }
  12.   latestCommentsNav {
  13.     list.10.wrap = <!--TYPO3SEARCH_end--><h2>|</h2><!--TYPO3SEARCH_begin-->
  14.     list.20.wrap = <!--TYPO3SEARCH_end--><div class="news_list">|</div><!--TYPO3SEARCH_begin-->
  15.   }
  16. } 
plugin.tx_t3blog_pi1 {
  views {
    list.10.wrap = <!--TYPO3SEARCH_end--> <h2>|</h2> <!--TYPO3SEARCH_begin-->
    list.20.wrap = <!--TYPO3SEARCH_end--> <div class="news_list"> | </div> <!--TYPO3SEARCH_begin-->
  } 
  blogList {
    singleNavigation.wrap = <!--TYPO3SEARCH_end--> <div id="singleNavigation">|</div> <!--TYPO3SEARCH_begin-->
  }
  archive {  
    listWrap.10.dataWrap = <!--TYPO3SEARCH_end--> <ul id="archive_{field:id}" class="{field:class}"> |  </ul> <!--TYPO3SEARCH_begin-->
  }
  latestCommentsNav {
    list.10.wrap = <!--TYPO3SEARCH_end--><h2>|</h2><!--TYPO3SEARCH_begin-->
    list.20.wrap = <!--TYPO3SEARCH_end--><div class="news_list">|</div><!--TYPO3SEARCH_begin--> 
  }
} 

Remove date form the URLs

We have noticed that links in a "singleNavigation" section are using the date of current post. Consequently, each posts exists at all dates that we have prepared a post. This is probably just a glich, since snowflake's blog renders this links correctly. We did not investigete this matter at all, since we wanted to remove dates from the URL. Hmm, perhaps this shouldn't be part of this post. 

You can remove that part of the URL with the following TypoScript in your plugin.tx_t3blog_pi1definition (see also "Customizing T3blog"):

 
  1. plugin.tx_t3blog_pi1 {
  2.   blogList {
  3.     titleLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  4.     single.moreLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  5.     textRow.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  6.     commentsLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  7.     singleNavTitleLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  8.     comment.30.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:blogUid}&tx_t3blog_pi1[blogList][editCommentUid]={field:uid}
  9.   }
  10.   views {
  11.     list.30.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  12.     link.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  13.   }
  14.   archive {
  15.     titleLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  16.   }
  17.   latestCommentsNav {
  18.     link.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  19.   }
  20.   latestPostNav {
  21.     list.30.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  22.     link.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  23.   }
  24. }
  25.  
plugin.tx_t3blog_pi1 {
  blogList {
    titleLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    single.moreLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    textRow.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    commentsLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    singleNavTitleLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    comment.30.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:blogUid}&tx_t3blog_pi1[blogList][editCommentUid]={field:uid}
  }
  views {
    list.30.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    link.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  }
  archive {
    titleLink.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  }
  latestCommentsNav {
    link.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  }
  latestPostNav {
    list.30.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
    link.10.typolink.additionalParams.dataWrap = &tx_t3blog_pi1[blogList][showUid]={field:uid}
  }
}

Unfortunatelly, we also had to change
typo3conf/ext/t3blog/pi1/widgets/blogList/class.blogList.php.
Find function getTrackbackLink and do some commenting like this:

 
  1. $trackBackParameters = t3lib_div::implodeArrayForUrl('tx_t3blog_pi1', array(
  2.   'blogList' => array(
  3.     /* 'day' => sprintf('%02d', $dateInfo['mday']),
  4.         'month' => sprintf('%02d', $dateInfo['mon']),
  5.         'year' => $dateInfo['year'],*/
  6.     'showUid' => $uid,
  7.     'trackback' => 1
  8.    )
  9. ));
$trackBackParameters = t3lib_div::implodeArrayForUrl('tx_t3blog_pi1', array(
  'blogList' => array(
    /* 'day' => sprintf('%02d', $dateInfo['mday']),
        'month' => sprintf('%02d', $dateInfo['mon']),
        'year' => $dateInfo['year'],*/
    'showUid' => $uid,
    'trackback' => 1
   )
));

If you are using permalink, you should comment function getPermalink in the same manner. The request to move this to the TypoScript have already been published on the forge.typo3.org. You can find some additional discussion there. My opinion is that URLs should be unicate as long as they are not limiting some "important" functionality.

Calendar

Additional source of duplicated content is calendar. When user browses through calendar, URL changes, but the content is the same. Since we are not using social bookmarking on List view, the duplicated content is not considered so problematic if we take into account additional functionality that is provided. 

For indexing purposes, this can be overcomed with slighlty different RealURL configuration that enables us to limit indexing only on Single view pages. We have moved the translation of date outside the 'blog post'. I have left the definitions in Slovenian language, so you can learn something new and at the same time check the URLs on our page.

TypoScript 
  1. 'datum' => array(
  2.     'leto' => array(
  3.        'GETvar' => 'tx_t3blog_pi1[blogList][year]',
  4.      ),
  5.      'mesec' => array(
  6.         'GETvar' => 'tx_t3blog_pi1[blogList][month]' ,
  7.      ),
  8.      'dan' => array(
  9.         'GETvar' => 'tx_t3blog_pi1[blogList][day]',
  10.      ),
  11. ),
  12. 'zapis' => array(
  13.   'zapis' => array (
  14.     'GETvar' => 'tx_t3blog_pi1[blogList][showUid]',
  15.     'lookUpTable' => array(
  16.       'table' => 'tx_t3blog_post',
  17.       'id_field' => 'uid',
  18.       'alias_field' => 'uid',
  19.       'addWhereClause' => ' AND deleted !=1 AND hidden !=1',
  20.       'useUniqueCache' => 1,
  21.       'useUniqueCache_conf' => array(
  22.         'strtolower' => 1,
  23.         'spaceCharacter' => '-',
  24.       )
  25.     )
  26.   )
  27. ),
'datum' => array(
    'leto' => array(
       'GETvar' => 'tx_t3blog_pi1[blogList][year]',
     ),
     'mesec' => array(
        'GETvar' => 'tx_t3blog_pi1[blogList][month]' ,
     ),
     'dan' => array(
        'GETvar' => 'tx_t3blog_pi1[blogList][day]',
     ),
),
'zapis' => array(
  'zapis' => array (
    'GETvar' => 'tx_t3blog_pi1[blogList][showUid]',
    'lookUpTable' => array(
      'table' => 'tx_t3blog_post',
      'id_field' => 'uid',
      'alias_field' => 'uid',
      'addWhereClause' => ' AND deleted !=1 AND hidden !=1',
      'useUniqueCache' => 1,
      'useUniqueCache_conf' => array(
        'strtolower' => 1,
        'spaceCharacter' => '-',
      )
    )
  )
),

Now, clear the configuration cache and add following rules to the Mngogosearch configuration:

Configuration type Indexing path Indexing method Comparison type Description
Realm */blog/zapis/* Allow String Allow indexing of Single view pages.
Realm *blog* Disallow String Disallow everything else.

The first record should be listed before the second record in the list view of your mnogoSearch indexing configuration.

Probably, it would be wise to define canonical for List view, but I am not sure if this is really neccessary, since algorithms of search engines can handle such cases, see Google Webmaster: Specify your canonical.