Mamad Purbo

Japanese Full-Text Indexing On PostgreSQL 8.0.x

I couldn't find any English resource on Japanese full-text indexing for PostgreSQL, so I suppose I feel like writing something on the subject. There are only 2 resources (in Japanese) about this that I could find:

  1. Using PostgreSQL 8.2 with new GIN indexing + TSearch2 module. The problem with this article is, it does not have an elegant solution on how to integrate with Japanese lexer to separate words in notorious space-less Japanese sentences. With this solution, basically we have to call the lexer externally (command line), then manually insert the output of the lexer to the database.
  2. Using PostgreSQL 8.0.x with GiST indexing plus an additional module created by NTT for easy integration with Kakasi (a popular, but no longer maintained, Japanese lexer).

Basically I just combined instructions from these 2 sites to make it work on the following environment:


So here's how to do it:

1. Install PostgreSQL more or less normally:

$ cd /opt/source
$ tar xvfz postgresql-8.0.9.tar.gz
$ cd postgresql-8.0.9
$ ./configure --prefix=/opt/pgsql
$ make
$ make install

2. Add environment variables to the .bash_profile as necessary. Mine looks something like this:

PGHOME=/opt/pgsql
LD_LIBRARY_PATH=$PGHOME/lib
PGDATA=$PG_HOME/data
PGPORT=5432

PATH=$PATH:$HOME/bin:$PGHOME/bin
MANPATH=$PGHOME/man:$MANPATH

export PG_HOME
export PGDATA
export PGPORT
export PATH
export LD_LIBRARY_PATH
export MANPATH

3. Start server and create the database. Note that I didn't managed to make it work with UTF-8 or SJIS encoding, possibly due to Kakasi that is using EUC-JP, so I created the database with EUC-JP:

$ initdb --encoding=EUC_JP --no-locale
$ postmaster&
$ createdb test

4. Install TSearch2 module:

$ cd /opt/source/postgresql-8.0.9/contrib/tsearch2
$ make
$ make install
$ cd /opt/pgsql/share/contrib
$ psql -f tsearch2.sql test

5. Install Kakasi:

$ cd /opt/source
$ tar xvzf kakasi-2.3.4.tar.gz
$ cd kakasi-2.3.4
$ ./configure
$ ./make
$ ./make check
$ ./make install

6. Untar NTT's Kakasi interface:

$ cd /opt/source
$ tar xvzf pgkakasiw-1.1.tar.gz
$ cd pgkakasiw-1.1

7. Edit NTT's Kakasi interface Makefile appropriately. Mine looks something like this:

....
top_builddir = /opt/source/postgresql-8.0.9
....

8. Build and install NTT's Kakasi interface:

$ make
$ make install
$ cd /opt/pgsql/share/contrib
$ psql -e -f pgkakasiw.sql test

9. NTT's Kakasi interface test, should look something like:

$ psql test
test=# SELECT pgkakasiw('オープンソース技術交流会に参加する');
pgkakasiw
-----------------------------------------
オープンソース 技術交流 会 に 参加 する
(1 row)

10. Test Kakasi integration with full-text indexing provided by TSearch2 module:

test# create table test_text
test# (
test# jptext_org text,
test# jptext_lex tsvector
test# );
CREATE TABLE

test=# create index ft_test_text_ix
test-# on test_text
test-# using gist(jptext_lex);
CREATE INDEX

test=# insert into test_text(jptext_org) values('オープンソース技術交流会に参加する');
INSERT 17651 1
test=# update test_text set jptext_lex = to_tsvector(pgkakasiw(jptext_org));
UPDATE 1

test=# select
test-# headline(jptext_org,q),
test-# rank(jptext_lex,q)
test-# from
test-# test_text,
test-# to_tsquery('参加|会') as q
test-# where
test-# jptext_lex @@ q
test-# order by
test-# rank(jptext_lex,q) desc
test-# ;
headline | rank
------------------------------------+------
オープンソース技術交流会に参加する | 0.19
(1 row)

11. Think about something nice and be happy or something.
<< Older