TurnKey Linux Virtual Appliance Library

Lucene search engine for Mediawiki

Hello,

I've been using Mediawiki for a while now and found the Lucene search engine very good for searching pdf's (and other documents) uploaded to the Mediawiki.

http://sourceforge.net/projects/ezmwlucene/

http://www.mediawiki.org/wiki/Extension:EzMwLucene

I recently discovered Turnkey Linux and noticed how so much of the configuration is pre done for Mediawiki.  So much so that I've transfered my company wiki over to it.

Unfortunately, ezmwlucene isn't that easy to get going, I never managed to get the script to stop and start in controlled way but I can get it going in an 'ad hoc' sort of way.

However, I think the search capability is of ezmwluce is better than the default search tools on Turnkey linux - not sure exactly what you're using 'cause it is better than the default Mediawiki search.

Anyway, given that your recent blog post called for improvements, I was wondering if you guys might considering putting your server config skills to the Mediawiki applicance and consider a way to intergrate this function.

Keep up the great work.

Alon Swartz's picture

Extension currently experimental

According to the MW link provided, the extension is currently experimental, and on SF it's tagged as beta, so I am a doubtful it will be included and enabled by default in the upcoming release.

But, it does seem useful so I've added it to my todo/explore list, and will take a closer look when working on the upcoming MW release.

Thanks for sharing.

Extension currently experimental

Thanks for taking the time to check out the links to ezmwlucene.  I appreciate your concerns about being beta and/or experimental.

From my perspective, all I can say is that it works and finds more 'hits' in my pdfs than the other search tools that I've used.  It's just that my linux skills aren't quite up there to rewrite the stop/start scripts or create a .deb package.  It does work with the openjdk java normally available by apt-get so that is one less complication.

Thanks again. You guys are onto something with this virtual appliance thing.


Lucene-search is a search engine designed to index and search Me

Lucene-search is a search engine designed to index and search MediaWiki content on large websites. It is based on Lucene search API. It extends the API to provide ranking based on number of backlinks, distributed searching and indexing, parsing of wikitext, incremental updates etc. This is the search engine currently being used on Wikimediawikis.

MediaWiki can use Extension:LuceneSearch (pre 1.13) or Extension:MWSearch (1.13+) to fetch results from this search engine.

Note: This extension is designed for large wikis - smaller sites may want to consider Extension:SphinxSearch.

Lucene Search vs ezmwlucene

Hi,

Thanks for the comment.  The Lucene-Search is already in effect in the MediaWiki virtual appliance and is what I am using now.  It just doesn't search the attached pdf's as well as the ezmwlucene engine.

Cheers,


Liraz Siri's picture

Above comment posted by a new type of spam bot

It seems to use some kind of search engine to populate the thread with what looks like a relevant response but is really just output from a mindless automaton. The giveaway was a broken signature (since deleted) advertising SEO services. I've blocked the user. The only reason I'm keeping this comment around is because it's somewhat on topic, there is already a reply, and I want to use this as an example of comments to be deleted on sight. Just in case we see more of this.
Jeremy's picture

How crazy is that!

An on-topic spambot!?! With potentially useful legitimate links! The mind boggles!

Liraz Siri's picture

It's only going to get worse

You know I wouldn't be surprised if eventually we get spam bots that can pass the turing test. As soon as that happens I'll have to figure out whether I want to join the resistance or bow down to our new AI spambot overlords and accept them into the TurnKey community. Maybe we can teach them to TKLPatch...

Jeremy's picture

Hmm now that's a worry!

It sounds like eventually it will be the unpredictable stupid things people do and say (rather than the intellegent, thoughtful things) that will be the only way to differentiate between humans and AI!?

Although I must say I think that'd be great if we could the spambots doing TKL dev work! Perhaps TKL need to embrace the future and run with it?! Maybe you can could add it to your extensive 'to do' list: TKLPatchBot :)

[Getting off topic] The evolution of advertising is interesting (and a little scary IMO). I was recently talking to a younger person (mid-20s, iProprietary gadget lover, etc) about it and they actually thought that many of my concerns about targetted marketing are what make it "great"! They were of the opinion that it "makes life better and easier"! I'm still not convinced but must admit that I was blown away by a whole different perspective on the issue that I had never considered (beyond hearing similar sentiments from marketing companies with an obvious vested interest). I'm still not convinced that this person hasn't been brainwashed but I don't really consider them a 'sucker' either...

I had wondered....

Thanks for the comment.  The reply struck me as somewhat mechanical but I didn't twig it could be a spam bot.

As a follow up.  I do have the java based ezmwlucene search engine working on my wiki and with help from other sources, created my own stop and start scripts for the service.  So now, if our server admins reboot the turnkey mediawiki, the search engine restarts itself and life continues :)

Cheers,


Mission Accomplished

A useful spambot: reminds me of http://xkcd.com/810/

Jeremy's picture

Hehe!

Nice one! :D

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account, used to display your avatar.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <p> <span> <div> <h1> <h2> <h3> <h4> <h5> <h6> <img> <map> <area> <hr> <br> <br /> <ul> <ol> <li> <dl> <dt> <dd> <table> <tr> <td> <em> <b> <u> <i> <strong> <font> <del> <ins> <sub> <sup> <quote> <blockquote> <pre> <address> <code> <cite> <strike> <caption>

More information about formatting options

Leave this field empty. It's part of a security mechanism.
(Dear spammers: moderators are notified of all new posts. Spam is deleted immediately)