Saturday, September 23, 2017

What's a good topic for a bachelor's thesis in Sentiment Analysis?


Over the past few months (soon close to a year) you, my readers, might have noticed decline in frequency of my blogging. There are few reasons, including practical (absence of time), but still the most two important are:

1. Blogger has not developed too much as a tool over time. It probably continues to be relatively popular and bringing some ad money, so Google did not shut it down. Moving over to might be a better idea in order to produce visually "shinier" posts and actually enjoy writing.

2. There are other interesting and more interactive ways to share one's knowledge. One of such, that I personally like, is The site offers a reverse model compared to blogging: you answer questions. This way you ensure, that at least the questioner will read your answer, but so might do other respondents. Rating of your answers is another component, that contributes to statistics and getting analogy of payment - credits, that you can later use for instance for boosting your answers to a larger audience. But I would say the latter is of lesser importance to me.

Since I have never actually figured out, whether Quora allows you to read posts without being registered, re-posting my answers here from time to time could be a good way to also maintain this blog alive.

So here we go (slightly edited version):

What's a good topic for a bachelor's thesis in Sentiment Analysis?

Apart from applying deep neural networks to sentiment analysis being exciting, another topic that is exciting both from research and practice perspective is sarcasm detection. It goes somewhat outside of the topic of sentiment analysis per se out to the opinion mining. Sentiment analysis precision and recall are affected by the sarcastic posts. This is because sarcastic posts tend to be positive on the surface (in fact to the conventional algorithms — ML based or rule-based ones), but suggest negative context.
There are interesting situations that arise as a result of failing to recognize sarcasm. Borrowing from [1]:

User 1 tweet:

You are doing great! Who could predict heavy travel between #Thanksgiving and #NewYearsEve. And bad cold weather in Dec! Crazy!

Response from a major U.S. Airline:

We #love the kind words! Thanks so much.

User 1:

wow, just wow, I guess I should have #sarcasm

User 2:

Ahhh..**** reps. Just had a stellar experience w them at Westchester, NY last week. #CustomerSvcFail

Response from a major U.S. Airline:

Thanks for the shout-out Bonnie. We’re happy to hear you had a #stellar experience flying with us. Have a great day.

User 2:

You misinterpreted my dripping sarcasm. My experience at Westchester was 1 of the worst I’ve had with ****. And there are many.
A. et al. Sarcasm Detection on Twitter: A Behavioral Modeling Approach Sarcasm Detection on Twitter

Sunday, October 9, 2016

Luke 6.2.1 release and all things open source


Indeed, luke 6.2.1 for lucene 6.2.1 is out of the oven. This is the proud moment for Tomoko Uchida, my co-committer to have been a release manager for the first time. Congrats, Tomoko!


As luke gets more and more stargazers on github (520 at the time of this writing), I tend to glance over the list of them which sometimes makes my day. But beyond that and more importantly, this lays out the community of Lucene / Solr / Elasticsearch users and developers, that hopefully enjoy using luke too. 

Big names on user list

Having access to the stats of the luke repo gives insights on who and when might be talking about luke. This time, it is PayPal Engineering. And here is their nice technical writeup on indexing lots of data in Elasticsearch and field usage of luke for optimizing the lucene index data structures:

London Lucene/Solr hackday

Hackday is an amazing way to jump out of a routine and think big: what can be improved in the search land of Lucene / Solr technology and tooling? It was great to see that luke was picked up as one topic on the Lucene / Solr hackday in London: And there it is, Marple, browser-driven explorer for lucene indexes: Go check it out.

New contributors to luke

Tomoko and I have been active promoting luke on various occasions, Lucene / Solr Revolution 2015 and  ApacheCon 2015. And of course on twitter. Recently Florian Hopf has become active in sending pull requests to improve luke and fix various nagging issues. Welcome!

Wednesday, April 13, 2016

Luke 6.0 has been released

#luke 6.0 has been released. Major upgrade to #lucene 6.0 api:

There are other interesting features cooking, like access to DocValues:

If you feel like contributing, either by code or documentation, feel free to join the project:

Wednesday, December 30, 2015

Apache Solr Enterprise Search Server -- Third edition

This year gave me a chance to be a technical reviewer of the book with search engine topic. The title is Apache Solr Enterprise Search Server and it saw the light in its third edition. The first edition back in 2010 helped me to start thinking in NoSQL way, despite that SQL has been literally everywhere (well, and still is). It does take a bit of mind warping to think beyond relational database lingo and data modelling and in my opinion is rather useful for your career as a software engineer.

Here goes my review on Amazon:

This book in its first edition was the first one around back in 2010, that covered Apache Solr in as much detail as I needed to get into the topic quickly. This third edition includes revisions for Apache Solr 5, notoriously covering things like Solr admin page, SolrCloud, scaling the search engine for large amount of documents, text analysis, indexing, search and even map-reducing your Solr index! In particular, throwing a MapReduce task at large-scale indexing task has been hard / unclear in the past and now it is available to any user of Apache Solr out of the box. This makes books like this immensely important to not waste one's time in looking around for useful bits of information scattered here and there. More importantly, authors of the book are directly involved into the project, either as Apache Solr / Lucene committers or active practitioners and developers of the technology. So I recommend this book for an entry-level and mid-level search engineers that look into getting their hands dirty with search problems and / or improving on the previously untapped areas of the search engine world.

Sunday, October 11, 2015

[ANNOUNCE] Luke 5.3.0 released: naturally runs on Java 8

This release runs on Java8 and does not run on Java7.

This release includes a number of pull requests and github issues. Worth mentioning:
#38 upgrade to 5.3.0 itself
#28 Added LUKE_PATH env variable to
#35 Added copy, cut, paste etc. shortcuts, using Mac command key
#34 Fixed lastAnalyzer retrieval (this feature remembers the last used analyzer on the Search tab)
#31 200 stargazers on github (by the time of this release the number crossed 260). Luke community is growing.

Everybody is welcome to contribute. If you feel like you care about search / indexing or would like to get deeper with Apache Lucene, go ahead and pick a ticket:
And, don't be afraid, we do not have any complaint departments:

All you need is your favourite beverage and a good debugger.

Wednesday, July 8, 2015

[ANNOUNCE] Luke 5.2.0 released

This is a major release supporting lucene / solr 5.2.0. Download the zip here:

It supports elasticsearch 1.6.0 (lucene 4.10.4)
Issues fixed:
#20 Added support for reconstructing field values of indexed and not stored fields, that do not expose positions.
Pull requests:
#23 Elasticsearch support and Shade plugin for assembly
#26 added .gitignore to project
#27 Lucene 5x support
#28 Added LUKE_PATH env variable to
#30 Luke 5.2

I'd like to highlight the contribution of Tomoko Uchida who has been recently very active in sending pull requests, including upgrade to lucene 5.x and first version of Apache Pivot based luke ui.

Wednesday, April 15, 2015

Luke gets support for Elasticsearch indices

That is that, really. The so long awaited proper support for elasticsearch indices.

Luke supported Apache Solr indices already. Why not Elasticsearch? The reason was, that ES uses its own SPI for postings format. If you tried to open an Elasticsearch index with luke before, you'd get something like:

A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Lucene40, Lucene41]

The biggest issue of supporting custom SPI is that you'd need to hack the luke jar binary and add the ES SPI. I bet it is not what you would want to spend your time on.

With the excellent pull request by apakulov luke uses shade maven plugin, that does all the magic. It magically updates the in-binary META-INF/services file with the following entry:


Currently this is available on luke master: and a pre-release: