Japanese Full-Text Indexing On PostgreSQL 8.0.x
04 Dec 2006- Using PostgreSQL 8.2 with new GIN indexing + TSearch2 module. The problem with this article is, it does not have an elegant solution on how to integrate with Japanese lexer to separate words in notorious space-less Japanese sentences. With this solution, basically we have to call the lexer externally (command line), then manually insert the output of the lexer to the database.
- Using PostgreSQL 8.0.x with GiST indexing plus an additional module created by NTT for easy integration with Kakasi (a popular, but no longer maintained, Japanese lexer).
Basically I just combined instructions from these 2 sites to make it work on the following environment:
- OS: Red Hat Linux release 9 (Shrike), Kernel 2.4.20-8
- Database: PostgreSQL 8.0.9 + TSearch2 module + NTT's Kakasi interface module that can be downloaded from here.
- Lexer: Kakasi 2.3.4
So here's how to do it:
1. Install PostgreSQL more or less normally:
$ cd /opt/source
$ tar xvfz postgresql-8.0.9.tar.gz
$ cd postgresql-8.0.9
$ ./configure --prefix=/opt/pgsql
$ make
$ make install
PGHOME=/opt/pgsql
LD_LIBRARY_PATH=$PGHOME/lib
PGDATA=$PG_HOME/data
PGPORT=5432
PATH=$PATH:$HOME/bin:$PGHOME/bin
MANPATH=$PGHOME/man:$MANPATH
export PG_HOME
export PGDATA
export PGPORT
export PATH
export LD_LIBRARY_PATH
export MANPATH
$ initdb --encoding=EUC_JP --no-locale
$ postmaster&
$ createdb test
$ cd /opt/source/postgresql-8.0.9/contrib/tsearch2
$ make
$ make install
$ cd /opt/pgsql/share/contrib
$ psql -f tsearch2.sql test
$ cd /opt/source
$ tar xvzf kakasi-2.3.4.tar.gz
$ cd kakasi-2.3.4
$ ./configure
$ ./make
$ ./make check
$ ./make install
$ cd /opt/source
$ tar xvzf pgkakasiw-1.1.tar.gz
$ cd pgkakasiw-1.1
....
top_builddir = /opt/source/postgresql-8.0.9
....
$ make
$ make install
$ cd /opt/pgsql/share/contrib
$ psql -e -f pgkakasiw.sql test
$ psql test
test=# SELECT pgkakasiw('オープンソース技術交流会に参加する');
pgkakasiw
-----------------------------------------
オープンソース 技術交流 会 に 参加 する
(1 row)
test# create table test_text
test# (
test# jptext_org text,
test# jptext_lex tsvector
test# );
CREATE TABLE
test=# create index ft_test_text_ix
test-# on test_text
test-# using gist(jptext_lex);
CREATE INDEX
test=# insert into test_text(jptext_org) values('オープンソース技術交流会に参加する');
INSERT 17651 1
test=# update test_text set jptext_lex = to_tsvector(pgkakasiw(jptext_org));
UPDATE 1
test=# select
test-# headline(jptext_org,q),
test-# rank(jptext_lex,q)
test-# from
test-# test_text,
test-# to_tsquery('参加|会') as q
test-# where
test-# jptext_lex @@ q
test-# order by
test-# rank(jptext_lex,q) desc
test-# ;
headline | rank
------------------------------------+------
オープンソース技術交流会に参加する | 0.19
(1 row)