import re
)And he wants to...:
(Sometimes also called: sentiment detection, opinion mining etc.)
Image by User: Benreis at wikivoyage shared, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22713889
feedback_simple = 'The schnitzel tastes good. The soup was too hot. The waiter was quick and polite.'
Splitting it into senteces is easy:
sentences = feedback_simple.split('.')
print('\n'.join(sentences))
The schnitzel tastes good The soup was too hot The waiter was quick and polite
Now the same for words:
words = sentences[0].split(' ')
print('\n'.join(words))
The schnitzel tastes good
Thas was easy, right?
feedback_rude = '''The waiter was very rude,
e.g. when I accidentally opened the wrong door
he screamed "Private!".'''
Let's try again with our trusty algorithm:
sentences = feedback_rude.split('.')
print('\n'.join(sentences))
The waiter was very rude, e g when I accidentally opened the wrong door he screamed "Private!"
There are better ways to do this. Spacy (but also other packages like nltk) provide standard functions for this.
First load the English language:
import spacy
nlp_en = spacy.load('en')
Next parse the document:
document = nlp_en(feedback_simple)
for sentence in document.sents:
print(sentence)
The schnitzel tastes good. The soup was too hot. The waiter was quick and polite.
document_rude = nlp_en(feedback_rude)
for sentence in document_rude.sents:
print(sentence)
The waiter was very rude, e.g. when I accidentally opened the wrong door he screamed "Private!".
Even though sentences print as simple strings, they are actually lists of words:
first_sent = next(document.sents)
for word in first_sent:
print(word)
The schnitzel tastes good .
Even though words are printed as simple strings, they actually are "tokens" and include meta information:
tastes_token = first_sent[2]
print(tastes_token)
tastes
tastes_token.lemma_ # basic form of word
'taste'
tastes_token.pos_ # "part of speech" = role of word in sentence
'VERB'
Full list: https://spacy.io/api/token#attributes
Many attributes have two variant with or without underscore at the end of the name, for example lemma
and lemma_
. The first are integer codes that are compact to store and quick to compare while the latter are easier to read.
print(tastes_token.pos_)
VERB
tastes_token.pos
99
Import what you need from spacy.symbols
:
from spacy.symbols import ADJ, NOUN, VERB
print(VERB)
99
from spacy.symbols import NAMES
print(NAMES[99]) # 99 = VERB
VERB
from spacy.symbols import IDS
print(IDS['VERB'])
99
lemma
and pos
can sometimes be wrong.There are several ways to find appropriate topics, for example:
gensim
: https://radimrehurek.com/gensim/This can be represented as Enum
:
from enum import Enum
class Topic(Enum):
AMBIENCE = 1
FOOD = 2
HYGIENE = 3
SERVICE = 4
VALUE = 5
There are serveral ways to represent a rating, for example:
For our example, we are going to use 3 degrees in both directions and represent them as an Enum
:
class Rating(Enum):
VERY_BAD = -3
BAD = -2
SOMEWHAT_BAD = -1
SOMEWHAT_GOOD = 1
GOOD = 2
VERY_GOOD = 3
.*schnitzel
Examples:
Lemma Topic Rating
------------ ---------- ------
waiter service
waitress service
wait bad
quick good
.*schnitzel food
music ambience
loud bad
Token
--> we need a matching()
function.And as we want to be able to use regular expressions and spaCy Token
we need to import them now:
import re
from spacy.tokens import Token
class LexiconEntry:
_IS_REGEX_REGEX = re.compile(r'.*[.+*\[$^\\]')
def __init__(self, lemma: str, topic: Topic, rating: Rating):
assert lemma is not None
self.lemma = lemma
self._lower_lemma = lemma.lower()
self.topic = topic
self.rating = rating
self.is_regex = bool(LexiconEntry._IS_REGEX_REGEX.match(self.lemma))
self._regex = re.compile(lemma, re.IGNORECASE) if self.is_regex else None
def matching(self, token: Token) -> float:
"""
A weight between 0.0 and 1.0 on how much ``token`` matches this entry.
"""
assert token is not None
result = 0.0
if self.is_regex:
if self._regex.match(token.text):
result = 0.6
elif self._regex.match(token.lemma_):
result = 0.5
else:
if token.text == self.lemma:
result = 1.0
elif token.text.lower() == self.lemma:
result = 0.9
elif token.lemma_ == self.lemma:
result = 0.8
elif token.lemma_.lower() == self.lemma:
result = 0.7
return result
def __str__(self) -> str:
result = 'LexiconEntry(%s' % self.lemma
if self.topic is not None:
result += ', topic=%s' % self.topic.name
if self.rating is not None:
result += ', rating=%s' % self.rating.name
if self.is_regex:
result += ', is_regex=%s' % self.is_regex
result += ')'
return result
def __repr__(self) -> str:
return self.__str__()
LexiconEntry
Token
(or None
)from math import isclose
class Lexicon:
def __init__(self):
self.entries: List[LexiconEntry] = []
def append(self, lemma: str, topic: Topic, rating: Rating):
lexicon_entry = LexiconEntry(lemma, topic, rating)
self.entries.append(lexicon_entry)
def lexicon_entry_for(self, token: Token) -> LexiconEntry:
"""
Entry in lexicon that best matches ``token``.
"""
result = None
lexicon_size = len(self.entries)
lexicon_entry_index = 0
best_matching = 0.0
while lexicon_entry_index < lexicon_size and not isclose(best_matching, 1.0):
lexicon_entry = self.entries[lexicon_entry_index]
matching = lexicon_entry.matching(token)
if matching > best_matching:
result = lexicon_entry
best_matching = matching
lexicon_entry_index += 1
return result
lexicon = Lexicon()
lexicon.append('waiter' , Topic.SERVICE , None)
lexicon.append('waitress' , Topic.SERVICE , None)
lexicon.append('wait' , None , Rating.BAD)
lexicon.append('quick' , None , Rating.GOOD)
lexicon.append('.*schnitzel', Topic.FOOD , None)
lexicon.append('music' , Topic.AMBIENCE, None)
lexicon.append('loud' , None , Rating.BAD)
lexicon.append('tasty' , Topic.FOOD , Rating.GOOD)
lexicon.append('polite' , Topic.SERVICE , Rating.GOOD)
feedback_text = 'The music was very loud.'
feedback = nlp_en(feedback_text)
for token in next(feedback.sents):
lexicon_entry = lexicon.lexicon_entry_for(token)
print(f'{token!s:10} {lexicon_entry}')
The None music LexiconEntry(music, topic=AMBIENCE) was None very None loud LexiconEntry(loud, rating=BAD) . None
Just add some filters and format the output:
feedback_text = 'The music was very loud.'
feedback = nlp_en(feedback_text)
for sent in feedback.sents:
print(sent)
for token in sent:
lexicon_entry = lexicon.lexicon_entry_for(token)
if lexicon_entry is not None:
if lexicon_entry.topic is not None:
print(' ', lexicon_entry.topic)
if lexicon_entry.rating is not None:
print(' ', lexicon_entry.rating)
The music was very loud. Topic.AMBIENCE Rating.BAD
The end? Not quite.
Impact on "loud":
Rating.BAD
Rating.VERY_BAD
Use sets:
INTENSIFIERS = {
'really',
'terribly',
'very',
}
def is_intensifier(token: Token) -> bool:
return token.lemma_.lower() in INTENSIFIERS
DIMINISHERS = {
'barely',
'slightly',
'somewhat',
}
def is_diminisher(token: Token) -> bool:
return token.lemma_.lower() in DIMINISHERS
For a little test, get the 4th token in 1st sentence, which is "very".
very_token = next(nlp_en(feedback_text).sents)[3]
print(very_token)
very
is_intensifier(very_token)
True
def signum(value) -> int:
if value > 0:
return 1
elif value < 0:
return -1
else:
return 0
_MIN_RATING_VALUE = Rating.VERY_BAD.value
_MAX_RATING_VALUE = Rating.VERY_GOOD.value
def _ranged_rating(rating_value: int) -> Rating:
return Rating(min(_MAX_RATING_VALUE, max(_MIN_RATING_VALUE, rating_value)))
def diminished(rating: Rating) -> Rating:
if abs(rating.value) > 1:
return _ranged_rating(rating.value - signum(rating.value))
else:
return rating
def intensified(rating: Rating) -> Rating:
if abs(rating.value) > 1:
return _ranged_rating(rating.value + signum(rating.value))
else:
return rating
print(diminished(Rating.BAD))
print(diminished(Rating.SOMEWHAT_BAD))
print(intensified(Rating.BAD))
Rating.SOMEWHAT_BAD Rating.SOMEWHAT_BAD Rating.VERY_BAD
Detection is similar to intensifiers and diminishers:
NEGATIONS = {
'no',
'not',
'none',
}
def is_negation(token: Token) -> bool:
return token.lemma_.lower() in NEGATIONS
Negating a Rating
is classic mapping issue:
_RATING_TO_NEGATED_RATING_MAP = {
Rating.VERY_BAD : Rating.SOMEWHAT_GOOD,
Rating.BAD : Rating.GOOD,
Rating.SOMEWHAT_BAD : Rating.GOOD, # hypothetical?
Rating.SOMEWHAT_GOOD: Rating.BAD, # hypothetical?
Rating.GOOD : Rating.BAD,
Rating.VERY_GOOD : Rating.SOMEWHAT_BAD,
}
def negated_rating(rating: Rating) -> Rating:
assert rating is not None
return _RATING_TO_NEGATED_RATING_MAP[rating]
print(Rating.GOOD, ' -> ', negated_rating(Rating.GOOD))
print(Rating.VERY_BAD, ' -> ', negated_rating(Rating.VERY_BAD))
Rating.GOOD -> Rating.BAD Rating.VERY_BAD -> Rating.SOMEWHAT_GOOD
Based on a simple lexicon and a few Python sets we can now assign sentiment information to single tokens concerning:
However, we still need to combine multiple tokens in our analysis. We could of course start messing with lists of tokens. However, spaCy offers nice possibilities for such situations.
nlp()
it performs multiple separate steps until it ends up with tokens and all their attributesToken
can get additional attributes (same goes for documents (Doc
) and sents (Span
), but we don't need this right now)Recommended reading: https://explosion.ai/blog/spacy-v2-pipelines-extensions
We can add new attributes for sentiment relevant information to the extensible "underscore" attribute:
Token.set_extension('topic', default=None)
Token.set_extension('rating', default=None)
Token.set_extension('is_negation', default=False)
Token.set_extension('is_intensifier', default=False)
Token.set_extension('is_diminisher', default=False)
Now we can set and examine these attributes
token = next(nlp_en('schnitzel').sents)[0]
print(token.lemma_)
token._.topic = Topic.FOOD
print(token._.topic)
schnitzel Topic.FOOD
In order to print tokens including the new attributes here's a little helper:
def debugged_token(token: Token) -> str:
result = 'Token(%s, lemma=%s' % (token.text, token.lemma_)
if token._.topic is not None:
result += ', topic=' + token._.topic.name
if token._.rating is not None:
result += ', rating=' + token._.rating.name
if token._.is_diminisher:
result += ', diminisher'
if token._.is_intensifier:
result += ', intensifier'
if token._.is_negation:
result += ', negation'
result += ')'
return result
print(debugged_token(token))
Token(schnitzel, lemma=schnitzel, topic=FOOD)
First we need a function to add to the pipeline that sets our new Token
attributes:
def opinion_matcher(doc):
for sentence in doc.sents:
for token in sentence:
if is_intensifier(token):
token._.is_intensifier = True
elif is_diminisher(token):
token._.is_diminisher = True
elif is_negation(token):
token._.is_negation = True
else:
lexicon_entry = lexicon.lexicon_entry_for(token)
if lexicon_entry is not None:
token._.rating = lexicon_entry.rating
token._.topic = lexicon_entry.topic
return doc
Then we can actually add it to the pipeline (and remove it first if it already was part of the pipeline):
if nlp_en.has_pipe('opinion_matcher'):
nlp_en.remove_pipe('opinion_matcher')
nlp_en.add_pipe(opinion_matcher)
With all the information attached to the token it is simple to reduce a sentence to its essential information:
def is_essential(token: Token) -> bool:
return token._.topic is not None \
or token._.rating is not None \
or token._.is_diminisher \
or token._.is_intensifier \
or token._.is_negation
def essential_tokens(tokens):
return [token for token in tokens if is_essential(token)]
For example:
document = nlp_en('The schnitzel is not very tasty.')
opinion_essence = essential_tokens(document)
for token in opinion_essence:
print(debugged_token(token))
Token(schnitzel, lemma=schnitzel, topic=FOOD) Token(not, lemma=not, negation) Token(very, lemma=very, intensifier) Token(tasty, lemma=tasty, topic=FOOD, rating=GOOD)
Now we have all the building blocks to apply intensifiers, diminishers and nagations on the rating. The basic idea is that when we encounter a taken with a rating and modifiers to the left of it that we can combine them until we don't find any more.
To keep things tidy, here's a another little helper:
def is_rating_modifier(token: Token):
return token._.is_diminisher \
or token._.is_intensifier \
or token._.is_negation
The previous sentence yielded the tokens:
Token(schnitzel, lemma=schnitzel, topic=FOOD)
Token(not, lemma=not, negation)
Token(very, lemma=very, intensifier)
Token(tasty, lemma=tasty, topic=FOOD, rating=GOOD)
We want to perform the following steps:
tasty(GOOD)
very(intensifier)
(very) tasty(VERY_GOOD)
and remove the left tokennot(negation)
(not very) tasty(BAD)
and remove the left tokennot very tasty(GOOD)
-> tasty(BAD)
def combine_ratings(tokens):
# Find the first rating (if any).
rating_token_index = next(
(
token_index for token_index in range(len(tokens))
if tokens[token_index]._.rating is not None
),
None # Default if no rating token can be found
)
if rating_token_index is not None:
# Apply modifiers to the left on the rating.
original_rating_token = tokens[rating_token_index]
combined_rating = original_rating_token._.rating
modifier_token_index = rating_token_index - 1
modified = True # Did the last iteration modify anything?
while modified and modifier_token_index >= 0:
modifier_token = tokens[modifier_token_index]
if is_intensifier(modifier_token):
combined_rating = intensified(combined_rating)
elif is_diminisher(modifier_token):
combined_rating = diminished(combined_rating)
elif is_negation(modifier_token):
combined_rating = negated_rating(combined_rating)
else:
# We are done, no more modifiers
# to the left of this rating.
modified = False
if modified:
# Discord the current modifier
# and move on to the token on the left.
del tokens[modifier_token_index]
modifier_token_index -= 1
original_rating_token._.rating = combined_rating
document = nlp_en('The schnitzel is not very tasty.')
opinion_essence = essential_tokens(document)
print('essential tokens:')
for token in opinion_essence:
print(' ', debugged_token(token))
combine_ratings(opinion_essence)
print('combined tokens:')
for token in opinion_essence:
print(' ', debugged_token(token))
essential tokens: Token(schnitzel, lemma=schnitzel, topic=FOOD) Token(not, lemma=not, negation) Token(very, lemma=very, intensifier) Token(tasty, lemma=tasty, topic=FOOD, rating=GOOD) combined tokens: Token(schnitzel, lemma=schnitzel, topic=FOOD) Token(tasty, lemma=tasty, topic=FOOD, rating=SOMEWHAT_BAD)
from typing import List, Tuple # for fancy type hints
def topic_and_rating_of(tokens: List[Token]) -> Tuple[Topic, Rating]:
result_topic = None
result_rating = None
opinion_essence = essential_tokens(tokens)
# print(' 1: ', opinion_essence)
combine_ratings(opinion_essence)
# print(' 2: ', opinion_essence)
for token in opinion_essence:
# print(debugged_token(token))
if (token._.topic is not None) and (result_topic is None):
result_topic = token._.topic
if (token._.rating is not None) and (result_rating is None):
result_rating = token._.rating
if (result_topic is not None) and (result_rating is not None):
break
return result_topic, result_rating
sentence = next(nlp_en('The schnitzel is not very tasty.').sents)
print(sentence)
print(topic_and_rating_of(sentence))
The schnitzel is not very tasty. (<Topic.FOOD: 2>, <Rating.SOMEWHAT_BAD: -1>)
def opinions(feedback_text: str):
feedback = nlp_en(feedback_text)
for tokens in feedback.sents:
yield(topic_and_rating_of(tokens))
feedback_text = """
The schnitzel was not very tasty.
The waiter was polite.
The football game ended 2:1."""
for topic, rating in opinions(feedback_text):
print(topic, rating)
Topic.FOOD Rating.SOMEWHAT_BAD Topic.SERVICE Rating.GOOD None None
In summary:
Plenty!
tasty
is about food.taste
combine_rating()
use abstract rules that can be processed by a rule engineNeverthess: the simple algorithm presented gets > 80% of feedback for inkeepers right. Which is good enough for its indended purpose of finding areas of interest that need further examination.
AUSTRIAN_TO_GERMAN_SYONYM_MAP = {
'nix': 'nichts', # nothing
'ois': 'alles', # everything
'kana': 'keiner', # no one (more eastern parts)
'koana': 'keiner', # no one (more western parts)
# ...
}
tokenizer_exceptions