Tuesday, January 16, 2018

New Luke on JavaFX

Hello and Happy New Year to my readers!

I'm happy to announce release of completely reimplemented Luke -- using JavaFX technology.  Luke is the toolbox for analyzing and maintaining your Lucene / Solr / Elasticsearch index on low level. 

The implementation was contributed by Tomoko Uchida, who also did the honors of releasing it.

The excitement of this release is supported by the fact, that in this version Luke becomes fully compliant with ALv2 license! And it gets very close to be contributed to Lucene project. At this point we need lots of testing to make sure JavaFX version is on par with the original thinlet based one.

Here is how load index screen looks like in new JavaFX luke:

After navigating to the Solr 7.1 index and pressing OK, here is what luke shows:

I have loaded an index of Finnish wikipedia with 1,069,778 documents, and luke tells me that the index does not have deletions and was not optimized. Let's go ahead and optimize it:

Notice, that on this dialogue you can request only expunging of deleted docs, without merging (the costly part for large indices). After optimization's complete, you'll have a full log of actions in front of you to confirm the operation was successful:

You could also opt for checking the health of your index via Tools -> Check index menu item:

Let's move to the Search tab. It has changed slightly in that search box has moved to the right, while search settings and other knobs were moved to the left.

Thinlet version:

JavaFX version:

It is more intuitive UI now in terms of access to various tools like Analyzer, Similarity (now with access to parameters of new BM25 ranking model, that became default in Lucene and default in luke) and More Like This. There is a new Sort sub-tab that lets you choose a primary and secondary field to sort on. Collectors tab however is gone: please let us know, if you used it for some task -- would love to learn.

Moving on to the Analysis tab, I'd like to draw your attention towards really cool functionality of loading custom jars with your implementation of a character filter, tokenizer or token filter to form your custom analyzer. Test these right in the luke UI without the need to reload shards in your Solr / Elasticsearch installation:

Last, but not least is Logs tab. Essentially you should have been missing it for as long as luke exists: getting a handle of what's happening behind the scenes during an error case or a normal operation.

In addition, this version of Luke supports the recently released Lucene 7.2.0.

Wednesday, November 1, 2017

Will deep learning make other machine learning algorithms obsolete?

The fourth (fifth?) quoranswer is here! This time we'll talk a bit about deep learning and its role in making other state of the art machine learning methods obsolete.

Will deep learning make other machine learning algorithms obsolete?

I will try to take a look at the question from the natural language processing perspective.

There is a class of problems in NLProc, that might not be benefited from deep learning (DL), at least directly. For the same reasons, machine learning  (ML) cannot help so easily. I will give three examples, which share more or less the same property so hard to model with ML or DL:

1. Identifying and analyzing a sentiment polarity oriented towards a particular object: person, brand etc. Example: I like phoneX, but dislike phoneY. If you monitor the sentiment situation for the phoneX you'll expect this message to be positive, while negative polarity for the phoneY. One can argue, it is easy / doable with ML / DL, but I doubt you can stay solely within that framework. Most probably you'll need a hybrid with rule-based system, syntactic parsing etc, which somewhat defeats the purpose of DL: be able to train neural network on a large amount of data without domain (linguist) knowledge.

2. Anaphora resolution. There are systems that use ML (and hence DL can be tried?), like BART coreference system , but most of the research I have seen so far is based around some sort of rules / syntactic parsing (this presentation is quite useful: Anaphora resolution). There is a vast application area for AR, including sentiment analysis and machine translation (also fact extraction, question-answering etc).

3. Machine translation. Disambiguation, anaphora, object relations, syntax, semantics and more in a single soup. Surely, you can try to model all of these with ML, but commercial systems in MT are more or less done with rules (+ml recently). I'm expecting DL can produce advancements in MT. I'll cite one paper here that uses DL and improves on phrase-based SMT: [1409.3215] Sequence to Sequence Learning with Neural Networks Update: some recent fun experiment with DL based machine translation.

The list can be extended to knowledge bases etc, but I hope I made my point.

Sunday, October 29, 2017

More fun with Google machine translation

Having posted in quoranswer tag specifically on machine translation tricks and challenges + looking at some fun with Mongolian->Russian translation with Google, I decided to experiment with Mongolian->English pair. To make this work, you'd need a Cyrillic keyboard and type only Russian letters 'а' as input on Mongolian language side. Throughout the text I'll refer to Google Translate as "neural network" or "network", as it has been known that Google has switched its translation system over to a Neural Network implementation.

So let's get going. It all starts rather sane:

а   -> a
аа -> ah

And as we stack up more letters on the left, we start getting more interesting translations:

ааа -> Well
аааа -> ahaha
ааааа -> sya
аааааа -> Well
ааааааа -> uh

and skipping a bit:

ааааааааа -> that's all

(at this point you'd imagine that deep neural network had some fun you teasing it and wants you to stop. But no).

аааааааааа -> that's ok
аааааааааааааа -> that's fine

ааааааааааааааааа -> everything is fine

ааааааааааааааааааа -> it's a good thing

And a bit more letters stacked up, the network begs to stop again, threatening:

ааааааааааааааааааааааааааааааааааааа -> it's all over

Then, after having enough of statements, the network starts asking questions.

ааааааааааааааааааааааааааааааааааааааааа -> is it a good thing?

and answers own question:

аааааааааааааааааааааааааааааааааааааааааа -> it's a good thing

few comments here and there:

ааааааааааааааааааааааааааааааааааааааааааааааааааа -> a good time

аааааааааааааааааааааааааааааааааааааааааааааааааааа-> to have a good time

Eventually, more dictionary entries crop in:

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to a whirlwind

And, unexpectedly:

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to make a date
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to make a living

Then, the network starts to output:

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to make a dicision

And begs me to put some sane words in instead of the letter non-sense:

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> put your own word

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a whistle-blower

The latter one is probably meant as an offence to add colour to network's ask.

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a private time in the world

Notice how general words are, like "private", "time", "world". Still they are grammatical and make sense, except unlikely as translations.

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a mortal year

And to begging again:

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a kindness in the world

Again, all my commentary is meant as fun, I'm not intending to (mis)lead you to something here.

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a dead dog

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> put ā € |

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a deadline

And more threats, again:

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a hash of you

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a mortal beefed up

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a heartbroker

A heartbroker? Really? Something new.

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a hash of a tree

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to put a lot of light on it

And finally, the network gets hungry:

ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to have a meal

And positively concludes:

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a date auspicious

аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a friend of a thousand years

Hope you had fun reading these, and please try some for yourselves.

Saturday, October 28, 2017

What are some funny Google Translate tricks?

This is the third quoranswer blog post, answering the question What are some funny Google Translate tricks? I have decided to update the Google translations based on the current situation. I think they are still a lot of fun. Let me know in comments, if you came across some funny translations!

There used to be a funny politically coloured trick for Russian->English, where sense was inverted on translation depending on what President names were used in positive vs negative context. I can’t reproduce it right now, but GT produces this at the moment:
Обама не при чём, виноват Путин.
human: Obama is innocent, Putin is to blame.
GT: Obama has nothing to do with Putin. (Previously in Aug 4, 2016: "Obama is not to blame, blame Putin.")
Путин не при чём, виноват Обама
human: Putin is innocent, Obama is to blame.
GT: Putin has nothing to do with Obama's fault. (Previously in Aug 4, 2016: "Putin is not being Obama's fault.")

Tuesday, October 24, 2017

What grammatical challenges prevent Google Translate from being more effective?

Here is one more Quora question on the exciting topic of machine translation and my answer to it.

The question had some sub-questions:

  • Is there a set of broad grammatical rules which decreases its efficacy?
  • How can these challenges be overcome? Is it possible to fully automate good quality translation?

Below is my answer, hoping it will be interesting to learn about machine translation and different language pairs. Note, that translations given currently by Google Translate might differ from below as they were obtained in 2013. UPD: and they do! See comments to this post.

Google is pretty good at modeling close enough language pairs. By close enough I mean languages that share multiple vocabulary units, have similar word order, morphological richness level and other grammatical features.

Let's pick an example of a pair, where Google Translate (GT) is good. Round-trip method is one way to verify whether the languages are close enough, at least statistically, for GT:

(these examples are using GT only, no human interpretation involved)

English: I am in a shop.
Dutch: Ik ben in een winkel.
back to English I'm in a store. (quite ok)

English: I danced into the room.
Dutch: Ik danste in de kamer.
back to English: I danced in the room. (preposition issues)

Let's pick a pair of more unrelated languages (by the way, when we claim the languages are unrelated grammatically, they may also be unrelated semantically or even pragmatically: different languages were created by people to suit their needs at particular moments of history). One such pair is English and Finnish:

Finnish: Hän on kaupassa.
English: He is in the shop.
Finnish: Hän on myymälä. (roughly the original Finnish sentence)

This example has pronoun hän, which in Finnish is not gender specific. It should be resolved based on larger context, than just a sentence. Somewhere before this sentence in a text, there should have been a mention of who hän is referring to.

To conclude this particular example: Google Translate translates on a sentence level and that is a limitation in itself, that makes correct pronoun resolution impossible. Pronouns are useful, if we wanted to understand, what was the interaction between the objects in a text.

Let's pick another example of unrelated languages: English and Russian.

Russian: Маска бывает правдивее и выразительнее лица.
English: The mask is truthful and expressive face. (should have been: The mask can be more truthful and expressive than face)
back to Russian: Маска правдивым и выразительным лицом. (hard to translate, but the meaning roughly: The mask being a truthful and expressive face).

To conclude this example: languges with rich morphology that, in the case of the Russian language, convey grammatical case in just a word inflection and thus require deeper grammatical analysis, which pure statistical machine translation methods lack no matter how much data has been acquired. There exist methods of combining rules and statistics together.

Another pair and different example:
English: Reporters said that IBM has bought Lotus.
Japanese: 記者は、IBMがロータスを買っていると述べた。
back to English: The reporter said that IBM Lotus are buying.

Japanese has a "recursive syntax", that represents this English sentence, like:

Reporters (IBM Lotus has bought) said that.

i.e. the verb is syntacically placed after the subject-object pair of a sentence or a sub-sentence (direct / indirect object).

To conclude this example: there should exist a method of mapping syntax structures as larger units of the language and that should be done in a more controlled fashion (i.e. is hard to derive from pure statistics).

Saturday, September 23, 2017

What's a good topic for a bachelor's thesis in Sentiment Analysis?


Over the past few months (soon close to a year) you, my readers, might have noticed decline in frequency of my blogging. There are few reasons, including practical (absence of time), but still the most two important are:

1. Blogger has not developed too much as a tool over time. It probably continues to be relatively popular and bringing some ad money, so Google did not shut it down. Moving over to medium.com might be a better idea in order to produce visually "shinier" posts and actually enjoy writing.

2. There are other interesting and more interactive ways to share one's knowledge. One of such, that I personally like, is quora.com. The site offers a reverse model compared to blogging: you answer questions. This way you ensure, that at least the questioner will read your answer, but so might do other respondents. Rating of your answers is another component, that contributes to statistics and getting analogy of payment - credits, that you can later use for instance for boosting your answers to a larger audience. But I would say the latter is of lesser importance to me.

Since I have never actually figured out, whether Quora allows you to read posts without being registered, re-posting my answers here from time to time could be a good way to also maintain this blog alive.

So here we go (slightly edited version):

What's a good topic for a bachelor's thesis in Sentiment Analysis?

Apart from applying deep neural networks to sentiment analysis being exciting, another topic that is exciting both from research and practice perspective is sarcasm detection. It goes somewhat outside of the topic of sentiment analysis per se out to the opinion mining. Sentiment analysis precision and recall are affected by the sarcastic posts. This is because sarcastic posts tend to be positive on the surface (in fact to the conventional algorithms — ML based or rule-based ones), but suggest negative context.
There are interesting situations that arise as a result of failing to recognize sarcasm. Borrowing from [1]:

User 1 tweet:

You are doing great! Who could predict heavy travel between #Thanksgiving and #NewYearsEve. And bad cold weather in Dec! Crazy!

Response from a major U.S. Airline:

We #love the kind words! Thanks so much.

User 1:

wow, just wow, I guess I should have #sarcasm

User 2:

Ahhh..**** reps. Just had a stellar experience w them at Westchester, NY last week. #CustomerSvcFail

Response from a major U.S. Airline:

Thanks for the shout-out Bonnie. We’re happy to hear you had a #stellar experience flying with us. Have a great day.

User 2:

You misinterpreted my dripping sarcasm. My experience at Westchester was 1 of the worst I’ve had with ****. And there are many.
A. et al. Sarcasm Detection on Twitter: A Behavioral Modeling Approach Sarcasm Detection on Twitter

Sunday, October 9, 2016

Luke 6.2.1 release and all things open source


Indeed, luke 6.2.1 for lucene 6.2.1 is out of the oven. This is the proud moment for Tomoko Uchida, my co-committer to have been a release manager for the first time. Congrats, Tomoko!


As luke gets more and more stargazers on github (520 at the time of this writing), I tend to glance over the list of them which sometimes makes my day. But beyond that and more importantly, this lays out the community of Lucene / Solr / Elasticsearch users and developers, that hopefully enjoy using luke too. 

Big names on user list

Having access to the stats of the luke repo gives insights on who and when might be talking about luke. This time, it is PayPal Engineering. And here is their nice technical writeup on indexing lots of data in Elasticsearch and field usage of luke for optimizing the lucene index data structures: https://www.paypal-engineering.com/2016/08/10/powering-transactions-search-with-elastic-learnings-from-the-field/

London Lucene/Solr hackday

Hackday is an amazing way to jump out of a routine and think big: what can be improved in the search land of Lucene / Solr technology and tooling? It was great to see that luke was picked up as one topic on the Lucene / Solr hackday in London: https://github.com/flaxsearch/london-hackday-2016. And there it is, Marple, browser-driven explorer for lucene indexes: https://github.com/flaxsearch/marple. Go check it out.

New contributors to luke

Tomoko and I have been active promoting luke on various occasions, Lucene / Solr Revolution 2015 and  ApacheCon 2015. And of course on twitter. Recently Florian Hopf has become active in sending pull requests to improve luke and fix various nagging issues. Welcome!