エロサイトの作り方

2013年11月から勉強しながらエロサイトを作っています。

MongoDB+Elasticsearchで全文検索をする(失敗)

MongoDBだけだと全文検索にインデックスが効かないので、200万件程度しかないうちの環境でも死んじゃう(実際死ぬ)

というわけで、全文検索を使えるようにするためにElasticsearchを入れることにしました。

Elasticsearchを入れる

Homebrewから

$ brew install elasticsearch

起動

$ elasticsearch
[2014-06-11 18:05:35,710][INFO ][node                     ] [Osiris] version[1.2.1], pid[4562], build[6c95b75/2014-06-03T15:02:52Z]
[2014-06-11 18:05:35,711][INFO ][node                     ] [Osiris] initializing ...
[2014-06-11 18:05:35,716][INFO ][plugins                  ] [Osiris] loaded [], sites []
[2014-06-11 18:05:38,059][INFO ][node                     ] [Osiris] initialized
[2014-06-11 18:05:38,059][INFO ][node                     ] [Osiris] starting ...
[2014-06-11 18:05:38,253][INFO ][transport                ] [Osiris] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
[2014-06-11 18:05:41,306][INFO ][cluster.service          ] [Osiris] new_master [Osiris][otJX2JoVQuCPkOqY-jSTNQ][Air.local][inet[/127.0.0.1:9300]], reason: zen-disco-join (elected_as_master)
[2014-06-11 18:05:41,331][INFO ][discovery                ] [Osiris] elasticsearch_hentai-kun/otJX2JoVQuCPkOqY-jSTNQ
[2014-06-11 18:05:41,343][INFO ][http                     ] [Osiris] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[/127.0.0.1:9200]}
[2014-06-11 18:05:41,360][INFO ][gateway                  ] [Osiris] recovered [0] indices into cluster_state
[2014-06-11 18:05:41,360][INFO ][node                     ] [Osiris] started

接続確認

別のタブで操作

インデックスを作る

$ curl -XPUT 'http://localhost:9200/sandbox'
{"acknowledged":true}%

成功した。

[2014-06-11 18:06:23,623][INFO ][cluster.metadata         ] [Osiris] [sandbox] creating index, cause [api], shards [5]/[1], mappings []

サーバー側にも表示されている。

標準機能で日本語解析してみる

$ curl -XGET 'http://localhost:9200/sandbox/_analyze?pretty' -d '貧乳はステータスだ!希少価値だ!'
{
  "tokens" : [ {
    "token" : "貧",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "<IDEOGRAPHIC>",
    "position" : 1
  }, {
    "token" : "乳",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "<IDEOGRAPHIC>",
    "position" : 2
  }, {
    "token" : "は",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "<HIRAGANA>",
    "position" : 3
  }, {
    "token" : "ステータス",
    "start_offset" : 3,
    "end_offset" : 8,
    "type" : "<KATAKANA>",
    "position" : 4
  }, {
    "token" : "だ",
    "start_offset" : 8,
    "end_offset" : 9,
    "type" : "<HIRAGANA>",
    "position" : 5
  }, {
    "token" : "希",
    "start_offset" : 10,
    "end_offset" : 11,
    "type" : "<IDEOGRAPHIC>",
    "position" : 6
  }, {
    "token" : "少",
    "start_offset" : 11,
    "end_offset" : 12,
    "type" : "<IDEOGRAPHIC>",
    "position" : 7
  }, {
    "token" : "価",
    "start_offset" : 12,
    "end_offset" : 13,
    "type" : "<IDEOGRAPHIC>",
    "position" : 8
  }, {
    "token" : "値",
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "<IDEOGRAPHIC>",
    "position" : 9
  }, {
    "token" : "だ",
    "start_offset" : 14,
    "end_offset" : 15,
    "type" : "<HIRAGANA>",
    "position" : 10
  } ]
}

カタカナはちゃんととれるけど、漢字がだめぽ。

日本語検索ができるようにする

Elasticsearchをインストールすると、pluginってコマンドが使えるようになり、それで機能追加を行う。

何かと被らないか不安になるコマンド名だ……

Japanese (kuromoji) Analysis for Elasticsearch

https://github.com/elasticsearch/elasticsearch-analysis-kuromoji

$ plugin --install elasticsearch/elasticsearch-analysis-kuromoji/2.1.0
-> Installing elasticsearch/elasticsearch-analysis-kuromoji/2.1.0...
Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-analysis-kuromoji/elasticsearch-analysis-kuromoji-2.1.0.zip...
Downloading ...(略)...DONE
Installed elasticsearch/elasticsearch-analysis-kuromoji/2.1.0 into /usr/local/Cellar/elasticsearch/1.2.1/plugins/analysis-kuromoji

Mapper Attachments Type for Elasticsearch

https://github.com/elasticsearch/elasticsearch-mapper-attachments

$ plugin --install elasticsearch/elasticsearch-mapper-attachments/2.0.0
-> Installing elasticsearch/elasticsearch-mapper-attachments/2.0.0...
Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-mapper-attachments/elasticsearch-mapper-attachments-2.0.0.zip...
Downloading ...(略)...DONE
Installed elasticsearch/elasticsearch-mapper-attachments/2.0.0 into /usr/local/Cellar/elasticsearch/1.2.1/plugins/mapper-attachments

MongoDB River Plugin for ElasticSearch

https://github.com/richardwilly98/elasticsearch-river-mongodb

$ plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0
-> Installing com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0...
Trying http://download.elasticsearch.org/com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/elasticsearch-river-mongodb-2.0.0.zip...
Trying http://search.maven.org/remotecontent?filepath=com/github/richardwilly98/elasticsearch/elasticsearch-river-mongodb/2.0.0/elasticsearch-river-mongodb-2.0.0.zip...
Downloading ...(略)...DONE
Installed com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0 into /usr/local/Cellar/elasticsearch/1.2.1/plugins/river-mongodb

Elasticsearchを再起動

$ elasticsearch
[2014-06-11 18:22:47,324][INFO ][node                     ] [Brother Nature] version[1.2.1], pid[6069], build[6c95b75/2014-06-03T15:02:52Z]
[2014-06-11 18:22:47,325][INFO ][node                     ] [Brother Nature] initializing ...
[2014-06-11 18:22:47,362][INFO ][plugins                  ] [Brother Nature] loaded [mapper-attachments, mongodb-river, analysis-kuromoji], sites [river-mongodb]
{1.2.1}: Initialization Failed ...
- ExecutionError[java.lang.NoClassDefFoundError: org/elasticsearch/rest/XContentRestResponse]
  NoClassDefFoundError[org/elasticsearch/rest/XContentRestResponse]
    ClassNotFoundException[org.elasticsearch.rest.XContentRestResponse]

エラーで起動せず。

一つ一つ消していったら、どうやらMongoDB River Pluginのせいっぽい。

$ plugin --remove com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0

一旦削除。

調べてもわからなかったので、先にMongoDBの設定を進めてみる。

MongoDBの設定

MongoDB River PluginではMongoDBのReplicaSetで使われる機能が必要らしいので有効にする。

レプリカサーバー自体は別に立てる必要はないらしい。

$ mongo
> rs.initiate()
{ "ok" : 0, "errmsg" : "server is not running with --replSet" }

起動時のオプションが足りないらしい。

> exit
$ vi /usr/local/etc/mongod.conf

...
replSet=crawler

設定ファイルに追加して、MongoDBを再起動。

$ mongo
> rs.initiate()
{
    "info2" : "no configuration explicitly specified -- making one",
    "me" : "Air.local:27017",
    "ok" : 0,
    "errmsg" : "couldn't initiate : can't find self in the replset config"
}

initiate()にパラメーターを渡す必要があるらしい。

> config = {_id:'crawler', members:[{_id:0, host:'localhost:27017'}]}
> rs.initiate(config)
{
   "info" : "Config now saved locally.  Should come online in about a minute.",
     "ok" : 1
}

できた。

再びElasticsearch

Elasticsearchの設定

変えておいた方がいい設定というのがあるようで。

$ vi /usr/local/Cellar/elasticsearch/1.2.1/config/elasticsearch.yml
...
cluster.name: elasticsearch_sandbox

同一ネットワークで複数立ち上げた時に勝手にクラスタリングしないようにデフォルトとは違う名前にしておくといいらしい。

MongoDB River Plugin for ElasticSearch

$ plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.0

さっき削除したので、もう一度入れる。

$ elasticsearch
[2014-06-12 05:27:51,797][INFO ][node                     ] [James Jaspers] version[1.2.1], pid[16599], build[6c95b75/2014-06-03T15:02:52Z]
[2014-06-12 05:27:51,797][INFO ][node                     ] [James Jaspers] initializing ...
[2014-06-12 05:27:51,811][INFO ][plugins                  ] [James Jaspers] loaded [analysis-kuromoji], sites []
[2014-06-12 05:27:54,647][INFO ][node                     ] [James Jaspers] initialized
[2014-06-12 05:27:54,647][INFO ][node                     ] [James Jaspers] starting ...
[2014-06-12 05:27:54,793][INFO ][transport                ] [James Jaspers] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
[2014-06-12 05:27:57,846][INFO ][cluster.service          ] [James Jaspers] new_master [James Jaspers][ApBCEQ8RRpm3PfYx7CPVxw][Air.local][inet[/127.0.0.1:9300]], reason: zen-disco-join (elected_as_master)
[2014-06-12 05:27:57,870][INFO ][discovery                ] [James Jaspers] elasticsearch_sandbox/ApBCEQ8RRpm3PfYx7CPVxw
[2014-06-12 05:27:57,888][INFO ][http                     ] [James Jaspers] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[/127.0.0.1:9200]}
[2014-06-12 05:27:57,912][INFO ][gateway                  ] [James Jaspers] recovered [0] indices into cluster_state
[2014-06-12 05:27:57,913][INFO ][node                     ] [James Jaspers] started

今度は起動できた。

MongoDB側のレプリカセットが有効になってないと起動できなくなるのかな?

kuromojiで日本語解析してみる

$ curl -XGET 'http://localhost:9200/sandbox/_analyze?analyzer=kuromoji&pretty' -d '貧乳はステータスだ!希少価値だ!'
{
  "error" : "NoSuchMethodError[org.apache.lucene.util.IOUtils.closeWhileHandlingException(Ljava/lang/Exception;[Ljava/io/Closeable;)V]",
  "status" : 500
}

エラーでt

[2014-06-12 05:41:00,023][DEBUG][action.admin.indices.analyze] [Asbestos Man] failed to execute [org.elasticsearch.action.admin.indices.analyze.AnalyzeRequest@7a206749]
java.lang.NoSuchMethodError: org.apache.lucene.util.IOUtils.closeWhileHandlingException(Ljava/lang/Exception;[Ljava/io/Closeable;)V
    at org.apache.lucene.analysis.ja.dict.BinaryDictionary.<init>(BinaryDictionary.java:123)
    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:40)
    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.<init>(TokenInfoDictionary.java:33)
    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary$SingletonHolder.<clinit>(TokenInfoDictionary.java:69)
    at org.apache.lucene.analysis.ja.dict.TokenInfoDictionary.getInstance(TokenInfoDictionary.java:62)
    at org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:214)
    at org.apache.lucene.analysis.ja.JapaneseTokenizer.<init>(JapaneseTokenizer.java:199)
    at org.apache.lucene.analysis.ja.JapaneseAnalyzer.createComponents(JapaneseAnalyzer.java:90)
    at org.apache.lucene.analysis.CustomAnalyzerWrapper.createComponents(CustomAnalyzerWrapper.java:60)
    at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:180)
    at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:227)
    at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:54)
    at org.elasticsearch.action.support.single.custom.TransportSingleCustomOperationAction$AsyncSingleAction$2.run(TransportSingleCustomOperationAction.java:174)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)

サーバーの方にもJavaスタックトレースが。

しかもググっても解決策が見つからないし……

というわけで

次回持ち越しの方向で。

解決したら続きを書きます。

書きました。