Introduction to sentiment analysis

Thomas Aglassinger

Agenda

  • Some background
  • What is sentiment analysis
  • Text parts: sentences and tokens
  • Topics and ratings
  • Special topics (slang terms, typos, emojis, ...)

Some background

About me

Why talk about sentiment analysis?

  • I needed a topic for a master's thesis
  • Former colleague who founded a startup: "We have these unstructured German text data that we'd like to analyze"
  • Quick glance
    • People write books about natural language processing (NLP)
    • We have Python (import re)
    • There are packages to deal with NLP (spaCy, nltk, ...)
  • "Good enough, let's get started!"

About TeLLers

  • Guests in inns / restaurants give feedback to innkeeper
    • non public
    • login optional
  • Mobile web application for guest (Angular)
  • Web application for owner (Django)
  • Homepage: https://tellers.co.at/

TeLLers feedback

  • Structured feedback (yes/no, "on a scale from 1 to 10")
    • simple to analyze
  • Unstructured text feedback
    • difficult to summarize and compare over time
    • "What did you enjoy the must during your visit?"
    • "How can we improve our products and service?"
    • "Is there anything else you want to tell us?"

Example screenshot

TeLLers screenshot

Innkeeper wants rough answers for...

  • How do people feel about certain areas of my business?
  • What are the perceived strengths of my business ("unique sales propositions")
  • What do I need to improve?

And he wants to...:

  • Get this at a glance.
  • If necessary drill down into specific feedbacks.
  • Service ticket systems
  • Preprocessing of customer email
  • Product reviews
  • Vibe from social media and forums

What is sentiment analysis?

  • „systematically identify, extract, quantify, and study affective states and subjective information“. Source: https://en.wikipedia.org/wiki/Sentiment_analysis
  • Collects opinions from text written in natural language and stores them in a structured way
  • Different levels:
    • Document
    • Sentence (possibly multiple per document)
    • Aspect (possibly multiple per sentence)

(Sometimes also called: sentiment detection, opinion mining etc.)

Document level

  • Example: product review sites
  • Several sentences describe the opinion
  • Summarize document in one rating, e.g. 3 starts out of 5 or thumbs up/down
  • Not very useful if there is both good and bad
    • You want to preserve the good parts (or improve even further)
    • You want to fix the bad part

Sentence level

  • Split the document in sentences.
  • Example: “The Schnitzel is too small for a hungry student”.

What's a schnitzel?

Schnitzel Image by User: Benreis at wikivoyage shared, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22713889

Aspect level

  • Multiple aspects in the same sentence
  • Example: "The schnitzel tastes very well but it is too small."
    • Schnitzel/taste = good
    • Schnitzel/size = bad

Definition: Opinion (deluxe edition)

  • Example: “The schnitzel is too small for a hungry student” (Hans Meier, 2018-04-28, 13:12 UTC)
  • Consists of:
    • Target entity: schnitzel
    • Aspect: size
    • Sentiment: bad
    • Opinion holder: Hans Meier
    • Posting time: 2018-04-28, 13:12 UTC
    • Reason: “too small”
    • Qualifier: “for a hungry student” → might be fine for others
  • Reference: Bing Liu, “Sentiment Analysis”, Cambridge Press, 2015, p. 22ff

Definition: Opinion (simplified)

  • Example: “The Schnitzel is too small for a hungry student” (Hans Meier, 2018-04-28, 13:12 UTC)
  • Consists of:
    • Topic: food
    • Sentiment: bad
    • Opinion holder: Hans Meier
    • Posting time: 2018-04-28, 13:12 UTC
  • Enough to get a grip about
    • Pain points
    • Unique sales propositions (USPs)

Basic workflow

  1. Collect data
  2. Preprocess data so the can be analyzed
  3. Analyze data
  4. Interpret and act on results

Basic workflow with TeLLers feedback

  1. Collect data → already happened
  2. Preprocess data so the can be analyzed → a little towards the end of the presentation
  3. Analyze data → main focus of this presentation
  4. Interpret and act on results → done by innkeeper using analysis UI

Enough of the pleasentries...

Brace yourself code incoming

Text parts: sentences and tokens

Splitting a document in sentences and words

This is easy, right?

So here's a feedback document:

In [1]:
feedback_simple = 'The schnitzel tastes good. The soup was too hot. The waiter was quick and polite.'

Splitting it into senteces is easy:

In [2]:
sentences = feedback_simple.split('.')
print('\n'.join(sentences))
The schnitzel tastes good
 The soup was too hot
 The waiter was quick and polite

Splitting words

Now the same for words:

In [3]:
words = sentences[0].split(' ')
print('\n'.join(words))
The
schnitzel
tastes
good

Thas was easy, right?

What about this?

In [4]:
feedback_rude = '''The waiter was very rude, 
e.g. when I accidentally opened the wrong door
he screamed "Private!".'''

Let's try again with our trusty algorithm:

In [5]:
sentences = feedback_rude.split('.')
print('\n'.join(sentences))
The waiter was very rude, 
e
g
 when I accidentally opened the wrong door
he screamed "Private!"

SpaCy to the rescue

There are better ways to do this. Spacy (but also other packages like nltk) provide standard functions for this.

First load the English language:

In [6]:
import spacy

nlp_en = spacy.load('en')

Next parse the document:

In [7]:
document = nlp_en(feedback_simple)
for sentence in document.sents:
    print(sentence)
The schnitzel tastes good.
The soup was too hot.
The waiter was quick and polite.

Abbreviations and indirect speech galore

In [8]:
document_rude = nlp_en(feedback_rude)
for sentence in document_rude.sents:
    print(sentence)
The waiter was very rude, 
e.g. when I accidentally opened the wrong door
he screamed "Private!".

Split into words

Even though sentences print as simple strings, they are actually lists of words:

In [9]:
first_sent = next(document.sents)
for word in first_sent:
    print(word)
The
schnitzel
tastes
good
.

Tokens

Even though words are printed as simple strings, they actually are "tokens" and include meta information:

In [10]:
tastes_token = first_sent[2]
print(tastes_token)
tastes
In [11]:
tastes_token.lemma_  # basic form of word
Out[11]:
'taste'
In [12]:
tastes_token.pos_  # "part of speech" = role of word in sentence
Out[12]:
'VERB'

Token attributes

Full list: https://spacy.io/api/token#attributes

Many attributes have two variant with or without underscore at the end of the name, for example lemma and lemma_. The first are integer codes that are compact to store and quick to compare while the latter are easier to read.

In [13]:
print(tastes_token.pos_)
VERB
In [14]:
tastes_token.pos
Out[14]:
99

Converting between spaCy names and IDs

Import what you need from spacy.symbols:

In [15]:
from spacy.symbols import ADJ, NOUN, VERB
print(VERB)
99
In [16]:
from spacy.symbols import NAMES

print(NAMES[99])  # 99 = VERB
VERB
In [17]:
from spacy.symbols import IDS
print(IDS['VERB'])
99

Limitations of spaCy

  • Tokenizers use probabilistic models.
  • lemma and pos can sometimes be wrong.
  • Typically good enough.
  • If not: build you own model

Topics and ratings

Topics

There are several ways to find appropriate topics, for example:

  • Look what others use in similar situations
  • Automatic detection using topic modelling, for example gensim: https://radimrehurek.com/gensim/
  • Build a tag cloud and see if it's useful
  • Ask domain experts (here: innkeepers)

Actual topics for our example

  • ambience: decoration, space, light, music, temperature, ...
  • food and beverages: eating, drinking, taste, menu, selection
  • hygiene: toilett, smell, ...
  • service: waiters, reaction times, courtesy, competence, availability, ...
  • value: price, size of portions, ...

Topics as code

This can be represented as Enum:

In [18]:
from enum import Enum

class Topic(Enum):
    AMBIENCE = 1
    FOOD = 2
    HYGIENE = 3
    SERVICE = 4
    VALUE = 5

Rating (sentiment)

There are serveral ways to represent a rating, for example:

  • Two distinct values "prositive" and "negative"
  • Same as above but with more distinct values, e.g. 1 to 5 stars
  • use a float between e.g. 0 and 1.0

Rating as code

For our example, we are going to use 3 degrees in both directions and represent them as an Enum:

In [19]:
class Rating(Enum):
    VERY_BAD = -3
    BAD = -2
    SOMEWHAT_BAD = -1
    SOMEWHAT_GOOD = 1
    GOOD = 2
    VERY_GOOD = 3

The lexicon

Contents of the lexicon

  • words relevant for our domain in their basic form (lemma)
    • here: words can be regular expression
    • for example: .*schnitzel
    • accepts various kinds, for instance "schnitzel" and "surschnitzel"
  • possible topic
  • possible rating
  • can easily be stored in spreadsheet, data base etc

Examples:

Lemma        Topic      Rating
------------ ---------- ------
waiter       service
waitress     service
wait                    bad
quick                   good
.*schnitzel  food
music        ambience
loud                    bad

How to collect words for lexicon?

  • Add words that are obvious and easy to find, for example collect food term from menu
  • Find the most common words in raw data and examine them
  • Analyse data early version and check sentences with no topic or rating for interesting words --> iterative improvement

Lexicon entries in Python

  • mostly a data container
  • but we also want to be able to compare if it matches a spaCy Token --> we need a matching() function.
  • tokens can match exactly or only after transformations (for example upper/lower case) --> score between 0 (no match) and 1 (perfect match)

And as we want to be able to use regular expressions and spaCy Token we need to import them now:

In [20]:
import re
from spacy.tokens import Token
In [21]:
class LexiconEntry:
    _IS_REGEX_REGEX = re.compile(r'.*[.+*\[$^\\]')

    def __init__(self, lemma: str, topic: Topic, rating: Rating):
        assert lemma is not None
        self.lemma = lemma
        self._lower_lemma = lemma.lower()
        self.topic = topic
        self.rating = rating
        self.is_regex = bool(LexiconEntry._IS_REGEX_REGEX.match(self.lemma))
        self._regex = re.compile(lemma, re.IGNORECASE) if self.is_regex else None

    def matching(self, token: Token) -> float:
        """
        A weight between 0.0 and 1.0 on how much ``token`` matches this entry.
        """
        assert token is not None
        result = 0.0
        if self.is_regex:
            if self._regex.match(token.text):
                result = 0.6
            elif self._regex.match(token.lemma_):
                result = 0.5
        else:
            if token.text == self.lemma:
                result = 1.0
            elif token.text.lower() == self.lemma:
                result = 0.9
            elif token.lemma_ == self.lemma:
                result = 0.8
            elif token.lemma_.lower() == self.lemma:
                result = 0.7
        return result

    def __str__(self) -> str:
        result = 'LexiconEntry(%s' % self.lemma
        if self.topic is not None:
            result += ', topic=%s' % self.topic.name
        if self.rating is not None:
            result += ', rating=%s' % self.rating.name
        if self.is_regex:
            result += ', is_regex=%s' % self.is_regex
        result += ')'
        return result

    def __repr__(self) -> str:
        return self.__str__()

The lexicon in Python

  • Contains a list of LexiconEntry
  • Can find the best matching entry for a Token (or None)
  • In the beginning entries have to be added
  • manually for our example, in practice from e.g .CSV
In [22]:
from math import isclose

class Lexicon:
    def __init__(self):
        self.entries: List[LexiconEntry] = []

    
    def append(self, lemma: str, topic: Topic, rating: Rating):
        lexicon_entry = LexiconEntry(lemma, topic, rating)
        self.entries.append(lexicon_entry)

    def lexicon_entry_for(self, token: Token) -> LexiconEntry:
        """
        Entry in lexicon that best matches ``token``.
        """
        result = None
        lexicon_size = len(self.entries)
        lexicon_entry_index = 0
        best_matching = 0.0
        while lexicon_entry_index < lexicon_size and not isclose(best_matching, 1.0):
            lexicon_entry = self.entries[lexicon_entry_index]
            matching = lexicon_entry.matching(token)
            if matching > best_matching:
                result = lexicon_entry
                best_matching = matching
            lexicon_entry_index += 1
        return result

Let's build a small lexicon

In [23]:
lexicon = Lexicon()
lexicon.append('waiter'     , Topic.SERVICE , None)
lexicon.append('waitress'   , Topic.SERVICE , None)
lexicon.append('wait'       , None          , Rating.BAD)
lexicon.append('quick'      , None          , Rating.GOOD)
lexicon.append('.*schnitzel', Topic.FOOD    , None)
lexicon.append('music'      , Topic.AMBIENCE, None)
lexicon.append('loud'       , None          , Rating.BAD)
lexicon.append('tasty'      , Topic.FOOD    , Rating.GOOD)
lexicon.append('polite'     , Topic.SERVICE , Rating.GOOD)

Matching tokens in a sentence to a lexicon entry

In [24]:
feedback_text = 'The music was very loud.'
feedback = nlp_en(feedback_text)
for token in next(feedback.sents):
    lexicon_entry = lexicon.lexicon_entry_for(token)
    print(f'{token!s:10} {lexicon_entry}')
The        None
music      LexiconEntry(music, topic=AMBIENCE)
was        None
very       None
loud       LexiconEntry(loud, rating=BAD)
.          None

Yeah, our first simple sentiment analysis!

Just add some filters and format the output:

In [25]:
feedback_text = 'The music was very loud.'
feedback = nlp_en(feedback_text)
for sent in feedback.sents:
    print(sent)
    for token in sent:
        lexicon_entry = lexicon.lexicon_entry_for(token)
        if lexicon_entry is not None:
            if lexicon_entry.topic is not None:
                print('    ', lexicon_entry.topic)
            if lexicon_entry.rating is not None:
                print('    ', lexicon_entry.rating)
The music was very loud.
     Topic.AMBIENCE
     Rating.BAD

The end? Not quite.

Intensifiers, diminishers, negations

Intensifiers and diminishers

  • increase or decrease the rating of sentiment words
  • examples:
    • diminishers: barely, slightly, somewhat, ...
    • intensifiers: really, terribly, very, ...

Impact on "loud":

  • "loud": Rating.BAD
  • "very loud": Rating.VERY_BAD

Intensifiers and diminishers in Python

Use sets:

In [26]:
INTENSIFIERS = {
    'really',
    'terribly',
    'very',
}

def is_intensifier(token: Token) -> bool:
    return token.lemma_.lower() in INTENSIFIERS

DIMINISHERS = {
    'barely',
    'slightly',
    'somewhat',
}

def is_diminisher(token: Token) -> bool:
    return token.lemma_.lower() in DIMINISHERS

Find out if a token is an intensifier

For a little test, get the 4th token in 1st sentence, which is "very".

In [27]:
very_token = next(nlp_en(feedback_text).sents)[3]
print(very_token)
very
In [28]:
is_intensifier(very_token)
Out[28]:
True

Intensify or diminish a Rating

In [29]:
def signum(value) -> int:
    if value > 0:
        return 1
    elif value < 0:
        return -1
    else:
        return 0

_MIN_RATING_VALUE = Rating.VERY_BAD.value
_MAX_RATING_VALUE = Rating.VERY_GOOD.value


def _ranged_rating(rating_value: int) -> Rating:
    return Rating(min(_MAX_RATING_VALUE, max(_MIN_RATING_VALUE, rating_value)))

def diminished(rating: Rating) -> Rating:
    if abs(rating.value) > 1:
        return _ranged_rating(rating.value - signum(rating.value))
    else:
        return rating

def intensified(rating: Rating) -> Rating:
    if abs(rating.value) > 1:
        return _ranged_rating(rating.value + signum(rating.value))
    else:
        return rating

print(diminished(Rating.BAD))
print(diminished(Rating.SOMEWHAT_BAD))
print(intensified(Rating.BAD))
Rating.SOMEWHAT_BAD
Rating.SOMEWHAT_BAD
Rating.VERY_BAD

Negations

  • turn a sentiment to the opposit
  • example: "not"
    • "tasty" = Rating.GOOD
    • "not tasty" = Rating.BAD
  • can be combined with intensifiers and diminishers
  • example:
    • "very good" = Rating.VERY_GOOD
    • "not very good" = Rating.SOMEWHAT_BAD
  • negation also swaps intensifier and diminisher

Negations in Python

Detection is similar to intensifiers and diminishers:

In [30]:
NEGATIONS = {
    'no',
    'not',
    'none',
}

def is_negation(token: Token) -> bool:
    return token.lemma_.lower() in NEGATIONS

Negation of a Rating

Negating a Rating is classic mapping issue:

In [31]:
_RATING_TO_NEGATED_RATING_MAP = {
    Rating.VERY_BAD     : Rating.SOMEWHAT_GOOD,
    Rating.BAD          : Rating.GOOD,
    Rating.SOMEWHAT_BAD : Rating.GOOD,  # hypothetical?
    Rating.SOMEWHAT_GOOD: Rating.BAD,  # hypothetical?
    Rating.GOOD         : Rating.BAD,
    Rating.VERY_GOOD    : Rating.SOMEWHAT_BAD,
}

def negated_rating(rating: Rating) -> Rating:
    assert rating is not None
    return _RATING_TO_NEGATED_RATING_MAP[rating]

print(Rating.GOOD, ' -> ', negated_rating(Rating.GOOD))
print(Rating.VERY_BAD, ' -> ', negated_rating(Rating.VERY_BAD))
Rating.GOOD  ->  Rating.BAD
Rating.VERY_BAD  ->  Rating.SOMEWHAT_GOOD

So far so good

Based on a simple lexicon and a few Python sets we can now assign sentiment information to single tokens concerning:

  • Topic
  • Rating
  • intensifiers / diminishers
  • nagations

However, we still need to combine multiple tokens in our analysis. We could of course start messing with lists of tokens. However, spaCy offers nice possibilities for such situations.

Extending spaCy's pipeline

What's this pipleline thingy?

  • When you pass a text to spaCy's nlp() it performs multiple separate steps until it ends up with tokens and all their attributes
  • Nice and clean "separation of concerns" (basic software principle)
  • Token can get additional attributes (same goes for documents (Doc) and sents (Span), but we don't need this right now)
  • Steps can be added or removed from the pipeline

Recommended reading: https://explosion.ai/blog/spacy-v2-pipelines-extensions

Extending Token

We can add new attributes for sentiment relevant information to the extensible "underscore" attribute:

In [32]:
Token.set_extension('topic', default=None)
Token.set_extension('rating', default=None)
Token.set_extension('is_negation', default=False)
Token.set_extension('is_intensifier', default=False)
Token.set_extension('is_diminisher', default=False)

Now we can set and examine these attributes

In [33]:
token = next(nlp_en('schnitzel').sents)[0]
print(token.lemma_)
token._.topic = Topic.FOOD
print(token._.topic)
schnitzel
Topic.FOOD

Intermission: a small debugging function

In order to print tokens including the new attributes here's a little helper:

In [34]:
def debugged_token(token: Token) -> str:
    result = 'Token(%s, lemma=%s' % (token.text, token.lemma_)
    if token._.topic is not None:
        result += ', topic=' + token._.topic.name
    if token._.rating is not None:
        result += ', rating=' + token._.rating.name
    if token._.is_diminisher:
        result += ', diminisher'
    if token._.is_intensifier:
        result += ', intensifier'
    if token._.is_negation:
        result += ', negation'
    result += ')'
    return result

print(debugged_token(token))
Token(schnitzel, lemma=schnitzel, topic=FOOD)

Extending the pipeline

First we need a function to add to the pipeline that sets our new Token attributes:

In [35]:
def opinion_matcher(doc):
    for sentence in doc.sents:
        for token in sentence:
            if is_intensifier(token):
                token._.is_intensifier = True
            elif is_diminisher(token):
                token._.is_diminisher = True
            elif is_negation(token):
                token._.is_negation = True
            else:
                lexicon_entry = lexicon.lexicon_entry_for(token)
                if lexicon_entry is not None:
                    token._.rating = lexicon_entry.rating
                    token._.topic = lexicon_entry.topic
    return doc

Then we can actually add it to the pipeline (and remove it first if it already was part of the pipeline):

In [36]:
if nlp_en.has_pipe('opinion_matcher'):
    nlp_en.remove_pipe('opinion_matcher')
nlp_en.add_pipe(opinion_matcher)

Extracting token relevant for the opinion

With all the information attached to the token it is simple to reduce a sentence to its essential information:

In [37]:
def is_essential(token: Token) -> bool:
    return token._.topic is not None \
        or token._.rating is not None \
        or token._.is_diminisher \
        or token._.is_intensifier \
        or token._.is_negation
        
def essential_tokens(tokens):
    return [token for token in tokens if is_essential(token)]

For example:

In [38]:
document = nlp_en('The schnitzel is not very tasty.')

opinion_essence = essential_tokens(document)
for token in opinion_essence:
    print(debugged_token(token))
Token(schnitzel, lemma=schnitzel, topic=FOOD)
Token(not, lemma=not, negation)
Token(very, lemma=very, intensifier)
Token(tasty, lemma=tasty, topic=FOOD, rating=GOOD)

Apply on Rating

Now we have all the building blocks to apply intensifiers, diminishers and nagations on the rating. The basic idea is that when we encounter a taken with a rating and modifiers to the left of it that we can combine them until we don't find any more.

To keep things tidy, here's a another little helper:

In [39]:
def is_rating_modifier(token: Token):
    return token._.is_diminisher \
        or token._.is_intensifier \
        or token._.is_negation

Example

The previous sentence yielded the tokens:

Token(schnitzel, lemma=schnitzel, topic=FOOD)
Token(not, lemma=not, negation)
Token(very, lemma=very, intensifier)
Token(tasty, lemma=tasty, topic=FOOD, rating=GOOD)

We want to perform the following steps:

  1. Find the first rating from the left -> tasty(GOOD)
  2. Check if the token to the left is a modifer -> yes: very(intensifier)
  3. Combine them -> (very) tasty(VERY_GOOD) and remove the left token
  4. Check if the token to the left is a modifer -> yes: not(negation)
  5. Combine them -> (not very) tasty(BAD) and remove the left token
  6. End result: not very tasty(GOOD) -> tasty(BAD)
In [40]:
def combine_ratings(tokens):
    # Find the first rating (if any).
    rating_token_index = next(
        (
            token_index for token_index in range(len(tokens))
            if tokens[token_index]._.rating is not None
        ),
        None  # Default if no rating token can be found
        
    )

    if rating_token_index is not None:
        # Apply modifiers to the left on the rating.
        original_rating_token = tokens[rating_token_index]
        combined_rating = original_rating_token._.rating
        modifier_token_index = rating_token_index - 1
        modified = True  # Did the last iteration modify anything?
        while modified and modifier_token_index >= 0:
            modifier_token = tokens[modifier_token_index]
            if is_intensifier(modifier_token):
                combined_rating = intensified(combined_rating)
            elif is_diminisher(modifier_token):
                combined_rating = diminished(combined_rating)
            elif is_negation(modifier_token):
                combined_rating = negated_rating(combined_rating)
            else:
                # We are done, no more modifiers 
                # to the left of this rating.
                modified = False
            if modified:
                # Discord the current modifier 
                # and move on to the token on the left.
                del tokens[modifier_token_index]
                modifier_token_index -= 1
        original_rating_token._.rating = combined_rating

Example for a combined rating

In [41]:
document = nlp_en('The schnitzel is not very tasty.')

opinion_essence = essential_tokens(document)
print('essential tokens:')
for token in opinion_essence:
    print('  ', debugged_token(token))

combine_ratings(opinion_essence)
print('combined tokens:')
for token in opinion_essence:
    print('  ', debugged_token(token))
essential tokens:
   Token(schnitzel, lemma=schnitzel, topic=FOOD)
   Token(not, lemma=not, negation)
   Token(very, lemma=very, intensifier)
   Token(tasty, lemma=tasty, topic=FOOD, rating=GOOD)
combined tokens:
   Token(schnitzel, lemma=schnitzel, topic=FOOD)
   Token(tasty, lemma=tasty, topic=FOOD, rating=SOMEWHAT_BAD)

A function to extract topic and rating of a sentence

In [42]:
from typing import List, Tuple  # for fancy type hints

def topic_and_rating_of(tokens: List[Token]) -> Tuple[Topic, Rating]:
    result_topic = None
    result_rating = None
    opinion_essence = essential_tokens(tokens)
    # print('  1: ', opinion_essence)
    combine_ratings(opinion_essence)
    # print('  2: ', opinion_essence)
    for token in opinion_essence:
        # print(debugged_token(token))
        if (token._.topic is not None) and (result_topic is None):
            result_topic = token._.topic
        if (token._.rating is not None) and (result_rating is None):
            result_rating = token._.rating
        if (result_topic is not None) and (result_rating is not None):
            break
    return result_topic, result_rating

sentence = next(nlp_en('The schnitzel is not very tasty.').sents)

print(sentence)
print(topic_and_rating_of(sentence))
The schnitzel is not very tasty.
(<Topic.FOOD: 2>, <Rating.SOMEWHAT_BAD: -1>)

A function to extract opinions from feedback

In [43]:
def opinions(feedback_text: str):
    feedback = nlp_en(feedback_text)
    for tokens in feedback.sents:
        yield(topic_and_rating_of(tokens))
In [44]:
feedback_text = """
The schnitzel was not very tasty. 
The waiter was polite.
The football game ended 2:1."""

for topic, rating in opinions(feedback_text):
    print(topic, rating)
Topic.FOOD Rating.SOMEWHAT_BAD
Topic.SERVICE Rating.GOOD
None None

In summary:

  • Food needs improvement, Service is fine.
  • Football results are of no interest to us.

Enough with the code!

mind blown

What's next?

Plenty!

  • modals: "could", "should" -> typically indicate a negative rating
  • idioms that indicate rating, for example "Leaves a lot to be desired"
  • back references: "he" (the waiter), "it" (the schnitzel) -> use topic from previous sentence if no other topic is given
  • add a topic hierarchy, for example tasty is about food.taste
  • Instead of careful handmade functions like combine_rating() use abstract rules that can be processed by a rule engine
  • ...and all the other linguist jazz!

Neverthess: the simple algorithm presented gets > 80% of feedback for inkeepers right. Which is good enough for its indended purpose of finding areas of interest that need further examination.

Special topics

Emojis

Slang terms

  • German vs Austrian
  • English vs Scottish
  • Preprocess and replace using synonyms.
  • Only for sentiment relevant terms.
In [45]:
AUSTRIAN_TO_GERMAN_SYONYM_MAP = {
    'nix': 'nichts',   # nothing
    'ois': 'alles',    # everything
    'kana': 'keiner',  # no one (more eastern parts)
    'koana': 'keiner', # no one (more western parts)
    # ...
}

Unknown abbreviations

  • example for know abbreviation: "resp." (respectively)
  • breaks sentence detection
  • if common: contribute to spaCy: module tokenizer_exceptions
  • if uncommon: add synonym for expanded form as preprocessing step

Typos

  • if word is not relevant for topics and ratings: ignore
  • common typos: use synonyms (similar to slang terms)
  • uncommon typos:
    • Is this really a thing?
    • Mobile input correction to the rescue
    • If you must:
    • Not needed for TeLLers

References

Summary

  • Sentiment analysis is challenging.
  • Python and spaCy help a lot with the development part.
  • Code complexity can remain managable.