MongoDB+Elasticsearchで全文検索をする(失敗)
MongoDBだけだと全文検索にインデックスが効かないので、200万件程度しかないうちの環境でも死んじゃう(実際死ぬ)
というわけで、全文検索を使えるようにするためにElasticsearchを入れることにしました。
Elasticsearchを入れる
Homebrewから
$ brew install elasticsearch
起動
$ elasticsearch
[2014-06-11 18:05:35,710][INFO ][node ] [Osiris] version[1.2.1], pid[4562], build[6c95b75/2014-06-03T15:02:52Z]
[2014-06-11 18:05:35,711][INFO ][node ] [Osiris] initializing ...
[2014-06-11 18:05:35,716][INFO ][plugins ] [Osiris] loaded [], sites []
[2014-06-11 18:05:38,059][INFO ][node ] [Osiris] initialized
[2014-06-11 18:05:38,059][INFO ][node ] [Osiris] starting ...
[2014-06-11 18:05:38,253][INFO ][transport ] [Osiris] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
[2014-06-11 18:05:41,306][INFO ][cluster.service ] [Osiris] new_master [Osiris][otJX2JoVQuCPkOqY-jSTNQ][Air.local][inet[/127.0.0.1:9300]], reason: zen-disco-join (elected_as_master)
[2014-06-11 18:05:41,331][INFO ][discovery ] [Osiris] elasticsearch_hentai-kun/otJX2JoVQuCPkOqY-jSTNQ
[2014-06-11 18:05:41,343][INFO ][http ] [Osiris] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[/127.0.0.1:9200]}
[2014-06-11 18:05:41,360][INFO ][gateway ] [Osiris] recovered [0] indices into cluster_state
[2014-06-11 18:05:41,360][INFO ][node ] [Osiris] started
接続確認
別のタブで操作
インデックスを作る
$ curl -XPUT 'http://localhost:9200/sandbox'
{"acknowledged":true}%
成功した。
[2014-06-11 18:06:23,623][INFO ][cluster.metadata ] [Osiris] [sandbox] creating index, cause [api], shards [5]/[1], mappings []
サーバー側にも表示されている。
標準機能で日本語解析してみる
$ curl -XGET 'http://localhost:9200/sandbox/_analyze?pretty' -d '貧乳はステータスだ!希少価値だ!'
{
"tokens" : [ {
"token" : "貧",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}, {
"token" : "乳",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 2
}, {
"token" : "は",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<HIRAGANA>",
"position" : 3
}, {
"token" : "ステータス",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<KATAKANA>",
"position" : 4
}, {
"token" : "だ",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<HIRAGANA>",
"position" : 5
}, {
"token" : "希",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}, {
"token" : "少",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 7
}, {
"token" : "価",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<IDEOGRAPHIC>",
"position" : 8
}, {
"token" : "値",
"start_offset" : 13,
"end_offset" : 14,
"type" : "<IDEOGRAPHIC>",
"position" : 9
}, {
"token" : "だ",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<HIRAGANA>",
"position" : 10
} ]
}
カタカナはちゃんととれるけど、漢字がだめぽ。
日本語検索ができるようにする
Elasticsearchをインストールすると、pluginってコマンドが使えるようになり、それで機能追加を行う。
何かと被らないか不安になるコマンド名だ……
Japanese (kuromoji) Analysis for Elasticsearch
https://github.com/elasticsearch/elasticsearch-analysis-kuromoji
$ plugin --install elasticsearch/elasticsearch-analysis-kuromoji/2.1.0
-> Installing elasticsearch/elasticsearch-analysis-kuromoji/2.1.0...
Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-analysis-kuromoji/elasticsearch-analysis-kuromoji-2.1.0.zip...
Downloading ...(略)...DONE
Installed elasticsearch/elasticsearch-analysis-kuromoji/2.1.0 into /usr/local/Cellar/elasticsearch/1.2.1/plugins/analysis-kuromoji
Mapper Attachments Type for Elasticsearch
https://github.com/elasticsearch/elasticsearch-mapper-attachments
$ plugin --install elasticsearch/elasticsearch-mapper-attachments/2.0.0
-> Installing elasticsearch/elasticsearch-mapper-attachments/2.0.0...
Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-mapper-attachments/elasticsearch-mapper-attachments-2.0.0.zip...
Downloading ...(略)...DONE
Installed elasticsearch/elasticsearch-mapper-attachments/2.0.0 into /usr/local/Cellar/elasticsearch/1.2.1/plugins/mapper-attachments
MongoDB River Plugin for ElasticSearch
https://github.com/richardwilly98/elasticsearch-river-mongodb
$ plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0
-> Installing com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0...
Trying http://download.elasticsearch.org/com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/elasticsearch-river-mongodb-2.0.0.zip...
Trying http://search.maven.org/remotecontent?filepath=com/github/richardwilly98/elasticsearch/elasticsearch-river-mongodb/2.0.0/elasticsearch-river-mongodb-2.0.0.zip...
Downloading ...(略)...DONE
Installed com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0 into /usr/local/Cellar/elasticsearch/1.2.1/plugins/river-mongodb
Elasticsearchを再起動
$ elasticsearch
[2014-06-11 18:22:47,324][INFO ][node ] [Brother Nature] version[1.2.1], pid[6069], build[6c95b75/2014-06-03T15:02:52Z]
[2014-06-11 18:22:47,325][INFO ][node ] [Brother Nature] initializing ...
[2014-06-11 18:22:47,362][INFO ][plugins ] [Brother Nature] loaded [mapper-attachments, mongodb-river, analysis-kuromoji], sites [river-mongodb]
{1.2.1}: Initialization Failed ...
- ExecutionError[java.lang.NoClassDefFoundError: org/elasticsearch/rest/XContentRestResponse]
NoClassDefFoundError[org/elasticsearch/rest/XContentRestResponse]
ClassNotFoundException[org.elasticsearch.rest.XContentRestResponse]
エラーで起動せず。
一つ一つ消していったら、どうやらMongoDB River Pluginのせいっぽい。
$ plugin --remove com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0
一旦削除。
調べてもわからなかったので、先にMongoDBの設定を進めてみる。
MongoDBの設定
MongoDB River PluginではMongoDBのReplicaSetで使われる機能が必要らしいので有効にする。
レプリカサーバー自体は別に立てる必要はないらしい。
$ mongo
> rs.initiate()
{ "ok" : 0, "errmsg" : "server is not running with --replSet" }
起動時のオプションが足りないらしい。
> exit
$ vi /usr/local/etc/mongod.conf
...
replSet=crawler
設定ファイルに追加して、MongoDBを再起動。
$ mongo
> rs.initiate()
{
"info2" : "no configuration explicitly specified -- making one",
"me" : "Air.local:27017",
"ok" : 0,
"errmsg" : "couldn't initiate : can't find self in the replset config"
}
initiate()にパラメーターを渡す必要があるらしい。
> config = {_id:'crawler', members:[{_id:0, host:'localhost:27017'}]}
> rs.initiate(config)
{
"info" : "Config now saved locally. Should come online in about a minute.",
"ok" : 1
}
できた。
再びElasticsearch
Elasticsearchの設定
変えておいた方がいい設定というのがあるようで。
$ vi /usr/local/Cellar/elasticsearch/1.2.1/config/elasticsearch.yml
...
cluster.name: elasticsearch_sandbox
同一ネットワークで複数立ち上げた時に勝手にクラスタリングしないようにデフォルトとは違う名前にしておくといいらしい。
MongoDB River Plugin for ElasticSearch
$ plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0
さっき削除したので、もう一度入れる。
$ elasticsearch
[2014-06-12 05:27:51,797][INFO ][node ] [James Jaspers] version[1.2.1], pid[16599], build[6c95b75/2014-06-03T15:02:52Z]
[2014-06-12 05:27:51,797][INFO ][node ] [James Jaspers] initializing ...
[2014-06-12 05:27:51,811][INFO ][plugins ] [James Jaspers] loaded [analysis-kuromoji], sites []
[2014-06-12 05:27:54,647][INFO ][node ] [James Jaspers] initialized
[2014-06-12 05:27:54,647][INFO ][node ] [James Jaspers] starting ...
[2014-06-12 05:27:54,793][INFO ][transport ] [James Jaspers] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
[2014-06-12 05:27:57,846][INFO ][cluster.service ] [James Jaspers] new_master [James Jaspers][ApBCEQ8RRpm3PfYx7CPVxw][Air.local][inet[/127.0.0.1:9300]], reason: zen-disco-join (elected_as_master)
[2014-06-12 05:27:57,870][INFO ][discovery ] [James Jaspers] elasticsearch_sandbox/ApBCEQ8RRpm3PfYx7CPVxw
[2014-06-12 05:27:57,888][INFO ][http ] [James Jaspers] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[/127.0.0.1:9200]}
[2014-06-12 05:27:57,912][INFO ][gateway ] [James Jaspers] recovered [0] indices into cluster_state
[2014-06-12 05:27:57,913][INFO ][node ] [James Jaspers] started
今度は起動できた。
MongoDB側のレプリカセットが有効になってないと起動できなくなるのかな?
kuromojiで日本語解析してみる
$ curl -XGET 'http://localhost:9200/sandbox/_analyze?analyzer=kuromoji&pretty' -d '貧乳はステータスだ!希少価値だ!'
{
"error" : "NoSuchMethodError[org.apache.lucene.util.IOUtils.closeWhileHandlingException(Ljava/lang/Exception;[Ljava/io/Closeable;)V]",
"status" : 500
}
エラーでt
[2014-06-12 05:41:00,023][DEBUG][action.admin.indices.analyze] [Asbestos Man] failed to execute [org.elasticsearch.action.admin.indices.analyze.AnalyzeRequest@7a206749]
java.lang.NoSuchMethodError: org.apache.lucene.util.IOUtils.closeWhileHandlingException(Ljava/lang/Exception;[Ljava/io/Closeable;)V
at org.apache.lucene.analysis.ja.dict.BinaryDictionary.<init>(BinaryDictionary.java:123)
at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:40)
at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:33)
at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:69)
at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62)
at org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:214)
at org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:199)
at org.apache.lucene.analysis.ja.JapaneseAnalyzer.createComponents(JapaneseAnalyzer.java:90)
at org.apache.lucene.analysis.CustomAnalyzerWrapper.createComponents(CustomAnalyzerWrapper.java:60)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:227)
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:54)
at org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$AsyncSingleAction$2.run(TransportSingleCustomOperationAction.java:174)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
しかもググっても解決策が見つからないし……
というわけで
次回持ち越しの方向で。
解決したら続きを書きます。