NLP - Primer
- Text, unstructured particularly, is as aboundant as important to understanding!
- Introduction to NN Translation with GPUs
- Sources
- Topic spotting
- Text classification
- Application
- chatbot
- translation
- sentiment analysis
String Primer
Unicode
- Remember to include U before string to ensure Unicode String
Regular Expression - strings with special syntax
- allows to match patterns in other strings
- Regex,
import re
- Matching pattern with string
re.match('abc', 'abcdef')
# word phrase
word_regex = '\w+'
re.match(word_regex, 'hi there!')
Common Regex Patterns
```\w+``` [word, 'Magic']
```\d``` [digit, 9]
```\s``` [space, '']
```.*``` [wildcard, 'usename74']
```+ or *``` [greedy, 'aaaaa']
capitalised = Negation, ```\S``` [Not space, 'no_spaces']
```[a-z]``` [lowercase group, 'abcedfg']
>>> return depends, iter, string, or match object
>>> useful for preprocessing before **TOKENISATION**
my_string = "Let's write RegEx! Won't that be fun? I sure think so. Can you find 4 sentences? Or perhaps, all 19 words?"
import re
sentence_endings = r"[.?!]"
print("\n",re.split(sentence_endings, my_string))
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))
spaces = r"\s+"
print("\n", re.split(spaces, my_string))
digits = r"\d+"
print("\n",re.findall(digits, my_string))
["Let's write RegEx", " Won't that be fun", ' I sure think so', ' Can you find 4 sentences', ' Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']
Tokenisation
- preparing text for NLP
- N-gram, punctuation, hashtages, etc
- NLTK module:
from nltk.tokenize import word_tokenize
- WHY
- easier to map SPEECH PART
- matching common words
- removing unwanted tokens
- e.g. ‘I don’t like Sam’s shoes’ -> “I”, “do”, “n’t”…
- Other NLTK class:
- ```sent_tokenize`` tokenise document into sentenses
regexp_tokenize
tokenise string or doc on regex patternTweetTokenizer
special class for tweet, hashtags, mentions, lots of exclamation points !!!
Diff(search, match)
- Search matches all on everywhere, unlike match only onset BUT useful for entire pattern or beginning
with open('Data_Folder/TxT/grail_abridged.txt', mode='r') as file:
scene_one = file.read()
from nltk.tokenize import sent_tokenize, word_tokenize
# Split as sentences
sentences = sent_tokenize(scene_one)
print('\n',sentences[:3])
# tokenise the 4th sentence
tokenize_sent = word_tokenize(sentences[3])
print('\n',tokenize_sent)
unique_tokens = set(word_tokenize(scene_one))
print('\n',unique_tokens)
['SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!', '[clop clop clop] \nSOLDIER #1: Halt!', 'Who goes there?']
['ARTHUR', ':', 'It', 'is', 'I', ',', 'Arthur', ',', 'son', 'of', 'Uther', 'Pendragon', ',', 'from', 'the', 'castle', 'of', 'Camelot', '.']
{'they', 'Not', 'husk', 'using', 'fly', 'under', 'tropical', 'That', 'lord', 'point', 'use', 'sovereign', 'winter', 'wants', 'to', 'second', 'Uther', 'ounce', 'together', 'five', 'by', '!', '1', 'coconut', ']', 'and', 'Patsy', 'Saxons', '2', 'join', 'or', 'pound', 'Halt', 'there', 'goes', 'is', 'Yes', 'seek', 'go', 'master', 'do', 'that', 'martin', "'re", 'warmer', 'son', 'ridden', 'SOLDIER', 'with', 'could', ',', "'em", 'Mercea', 'agree', 'mean', 'A', 'KING', 'clop', 'one', 'court', 'You', 'It', 'two', '--', 'get', 'our', 'must', 'They', 'Pendragon', 'forty-three', 'south', 'back', 'bird', 'ratios', '?', 'migrate', 'here', 'breadth', 'will', 'he', 'empty', 'other', 'creeper', "'d", 'the', 'So', 'Will', 'Wait', 'Arthur', 'why', 'got', 'interested', 'have', 'house', 'defeator', 'where', 'Britons', 'dorsal', 'In', 'carried', 'maintain', 'your', 'halves', 'Listen', 'Supposing', 'my', 'length', 'We', 'found', 'temperate', 'Court', 'strangers', 'tell', 'anyway', '[', 'on', 'through', 'Whoa', 'you', 'be', 'England', 'horse', 'yeah', 'am', 'just', "'s", "n't", '#', 'all', 'may', 'African', 'Are', 'land', 'in', 'line', 'Well', 'needs', 'this', 'European', 'Where', 'zone', 'ARTHUR', 'minute', 'castle', 'from', 'Please', 'trusty', 'swallows', "'m", 'then', 'these', 'every', 'simple', 'grip', 'held', 'strand', 'guiding', 'maybe', '...', "'", 'swallow', 'speak', 'weight', 'Found', 'beat', 'me', 'bangin', 'but', 'King', 'What', 'SCENE', 'are', 'kingdom', 'grips', 'velocity', 'suggesting', 'times', 'Pull', 'of', 'Camelot', 'But', 'The', 'question', 'an', 'Ridden', 'snows', '.', "'ve", 'No', 'matter', 'at', 'covered', 'servant', 'who', 'order', 'non-migratory', 'them', 'climes', 'Oh', 'feathers', 'it', 'Am', 'air-speed', 'its', 'plover', 'carrying', 'wings', 'Who', 'wind', 'ask', 'knights', 'I', 'bring', 'if', 'not', 'a', 'does', 'yet', 'course', 'coconuts', 'carry', ':', 'right', 'search', 'sun', 'since'}
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)
# Print the start and end indexes of match
print(match.start(), match.end())
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"
# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))
580 588
<_sre.SRE_Match object; span=(9, 32), match='[wind] [clop clop clop]'>
<_sre.SRE_Match object; span=(0, 7), match='ARTHUR:'>
Advanced Tokenisation
- union |
- grouping ()
- explicit char range []
- ex:
- digit-word =
('(\d+|\w+)')
- digit-word =
- range and group
- special character “"
- [A-Za-z]+ “upper/lower alphabet”
- [0-9] “numeric 0-9”
- [A-Za-z-.]+ “upper/lower, - and .”
- (a-z) “a - and z” GROUP
- (\s+|,) “spaces or comma”
- e.g. ‘[a-z0-9 ]+’ meaning lower, digits, SPACE, greedily -> using .match() will stop at any punctuation, as not specified
my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"
# to tokenise by words & punctuation & retain #1
from nltk.tokenize import regexp_tokenize
regexp_tokenize(my_string, '(\w+|#\d|\?|!)')
['SOLDIER',
'#1',
'Found',
'them',
'?',
'In',
'Mercea',
'?',
'The',
'coconut',
's',
'tropical',
'!']
# Tweet tokenisation
tweets = ['This is the best #nlp exercise ive found online! #python',
'#NLP is super fun! <3 #learning',
'Thanks @datacamp :) #nlp #python']
from nltk.tokenize import TweetTokenizer
# first use regexp via key symbols
print(regexp_tokenize(str(tweets), '[@#]\w+'))
# then specialised class
tweetTK = TweetTokenizer()
all_tokens = [tweetTK.tokenize(t) for t in tweets]
print('\n', all_tokens)
['#nlp', '#python', '#NLP', '#learning', '@datacamp', '#nlp', '#python']
[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]
# Non-ASCII tokenisation
german_text = 'Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'
from nltk.tokenize import word_tokenize
word_tokenize
<function nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)>
Unicode ranges for emoji are:
- ('\U0001F300’-'\U0001F5FF’)
- ('\U0001F600-\U0001F64F’)
- ('\U0001F680-\U0001F6FF’)
- ('\u2600’-\u26FF-\u2700-\u27BF’)
all_words = word_tokenize(german_text)
print(all_words)
capital_german = r"[A-ZÜ]\w+"
print('\n', regexp_tokenize(german_text, capital_german))
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print('\n', regexp_tokenize(german_text, emoji))
['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']
['Wann', 'Pizza', 'Und', 'Über']
['🍕', '🚕']
Charting Word Length
- Charting and graphs and animations
- Visualise
with open('Data_Folder/TxT/grail.txt', mode='r') as file:
grail_text = file.read()
grail_lines = grail_text.split(sep='\n')
# replacing all script lines for speaker or NAME: instances
pattern_speaker = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
grail_lines = [re.sub(pattern_speaker, '', line) for line in grail_lines]
grail_tokenised = [regexp_tokenize(s, "\w+") for s in grail_lines]
line_num_words = [len(t_line) for t_line in grail_tokenised]
plt.hist(line_num_words)
plt.show()
(array([916., 177., 52., 22., 9., 4., 4., 5., 1., 2.]),
array([ 0. , 10.3, 20.6, 30.9, 41.2, 51.5, 61.8, 72.1, 82.4,
92.7, 103. ]),
<a list of 10 Patch objects>)
Bag-of-word Counting
- Tokenise-Count flow
- frequency is a common statistics (max or min)
- e.g. lower(all text) to evade duplication
- e.g. preprocess by expurgating articles and frivolous words
with open('Data_Folder/TxT/wiki_txt2/wiki_text_debugging.txt', 'r') as file:
wiki_debugging = file.read()
from collections import Counter
wiki_debugging_TK = word_tokenize(wiki_debugging)
wiki_debugging_lower = [t.lower() for t in wiki_debugging_TK]
bow_wiki_debugging = Counter(wiki_debugging_lower)
print(bow_wiki_debugging.most_common(10))
type(bow_wiki_debugging)
[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 66), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]
collections.Counter
Text Preprocessing
- preparing for ML or analysis
- e.g. token, bow, lowercasing, etc.
- Lemmatisation/Stemming shortening to root stems
-
.isalpha()
-
from ntlk.corpus import stopwords
[... if t not in stopwords.words('english')]
-
- try and error method to context
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
alpha_only = [t for t in wiki_debugging_lower if t.isalpha()]
english_stops = stopwords.words('english')
no_stops = [t for t in alpha_only if t not in english_stops]
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
bow = Counter(lemmatized)
print(bow.most_common(10))
[('debugging', 40), ('system', 25), ('bug', 17), ('software', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]
GENSIM
- open NLP lib using top academic models to action complex tasks
- Vectorising doc or word
- Topic spotting and comparison
- vector space, distance, similarity analysis for semantics and relation
- Typically SparseMatrix format, [muse]
- Example
- length(Male-Female) ~= length(King-Queen)
- length-diff(verb tenses)
- Node and Edge of capital-city pair
- In a nutshell, gensim level up token-id-count process of doc or word
-
from gensim.corpora.dictionary import Dictionary
-
doc = [‘list of strings’]
-
tokenise doc.lower
-
dict = Dictionary(doc_tk)
dict.token2id
a {} of all token-ID allocation -
build own corpus
[dict.doc2bow(doc) for doc in doc_tk]
-
corpus is now [[(id, count)],[]..]
-
- Key
- ease of storing, updating and reuse
- dict is mutable
- more advanced and feature rich BOW to used
# loop open files in filepath + pattern
import os
import glob
article_sum = []
for filepath in glob.glob(os.path.join('Data_Folder/TxT/wiki_txt2/', '*.txt')):
with open(filepath,'r') as file:
article_raw = file.read()
article_lower = [t.lower() for t in word_tokenize(article_raw)]
article_alpha = [t for t in article_lower if t.isalpha()]
article_tk = [t for t in article_alpha if t not in english_stops]
article_sum.append(article_tk)
len(article_sum)
for i in range(len(article_sum)):
print(len(article_sum[i]))
12
4095
2894
1051
2963
6947
1984
463
3235
2109
1271
722
3045
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(article_sum)
computer_id = dictionary.token2id.get("computer")
print(dictionary.get(computer_id))
corpus = [dictionary.doc2bow(article) for article in article_sum]
print(corpus[4][:10]) # 5th doc first 10 mots
computer
[(4, 1), (6, 6), (7, 2), (9, 5), (18, 1), (19, 1), (20, 1), (22, 1), (24, 2), (28, 3)]
doc_fifth = corpus[4]
doc_fifth[:4] # List of tuples (id, count)
# sort doc by frequency, or count
bow_doc_fifth = sorted(doc_fifth,
key = lambda w: w[1],
reverse=True)
# print top-5 (id + count)
for id, count in bow_doc_fifth[:5]:
print(dictionary.get(id), count)
from collections import defaultdict
total_word_count = defaultdict(int)
import itertools
for id, count in itertools.chain.from_iterable(corpus):
total_word_count[id] += count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)
for id, count in sorted_word_count[:5]:
print(dictionary.get(id), count)
[(4, 1), (6, 6), (7, 2), (9, 5)]
computer 251
computers 100
first 61
cite 59
computing 59
computer 597
software 450
cite 322
ref 259
code 235
Tf-idf + Gensim
- Term Frequency - Inverse Document Frequency
- Common model used to id importance in each document from the corpus
- Logic: each corpus may have shared words beyond just stopwords
- down-weighted importance of context-word
- e.g. astronomy: ‘Sky’
- dismissing context-adjusted words
- up-weighted specific frequency
- Function
$w_{i,j} = tf_{i,j} * \log(\frac{N}{df_i})$
- $w_{i,j}$ = tf-idf weight for token i in doc j
- $tf_{i,j}$ = # occurences of token i in doc j
- $df_i$ = # doc containing token i
- N = total # of doc
- E.g. ‘computer’ appears 5 times in a doc of 100 words; what’s weight given a corpus of 200 doc of which 20 doc mentioned the word
- (5/100) * log(200/20)
- tf = percentage share of word compared to all tokens in the doc idf = log(total doc / contained doc)
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
# strange apply of TfidfModel() instance
tfidf_weights = tfidf[doc_fifth]
print(tfidf_weights[:5],'\n')
# sort weights highest to lowest
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
for id, weight in sorted_tfidf_weights[:5]:
print(dictionary.get(id), weight)
[(4, 0.005117037137639146), (6, 0.005095225240405224), (7, 0.00815539037400173), (9, 0.02558518568819573), (18, 0.003228490980266662)]
mechanical 0.1836016185939278
circuit 0.15046224827624213
manchester 0.14187397800439872
alu 0.13888822917806967
thomson 0.12731421007989718
Named Entity Recognition
- Motif: NLP task to identify important NE in the text
- people, places, organisations
- dates, states, nouns
- Used alongside topic identification
Stanford CoreNLP Library
- Integrated into Python via nltk
- Java based
- API able
- Support for NER + coReference and dependency trees
- Built-in pos_tag, etc in nltk
with open('Data_Folder/TxT/News articles/uber_apple.txt','r') as file:
article_uber = file.read()
import nltk
# first tk article into sentences
sentence_uber = sent_tokenize(article_uber)
# second tk sentences into words
sentence_uber_tk = [word_tokenize(sent) for sent in sentence_uber]
# tag words into speech-part
pos_sentences = [nltk.pos_tag(sent) for sent in sentence_uber_tk]
# Create Named Entity chunks
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)
# Test for stems of Tree with 'NE' tags
for sent in chunked_sentences:
for chunk in sent:
if hasattr(chunk, "label") and chunk.label() == "NE":
print(chunk)
(NE Uber/NNP)
(NE Beyond/NN)
(NE Apple/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Travis/NNP Kalanick/NNP)
(NE Tim/NNP Cook/NNP)
(NE Apple/NNP)
(NE Silicon/NNP Valley/NNP)
(NE CEO/NNP)
(NE Yahoo/NNP)
(NE Marissa/NNP Mayer/NNP)
# Charting NE
ner_categories = defaultdict(int)
# loading a new article in non-binary form
with open('Data_Folder/TxT/News articles/articles.txt', 'r') as file:
article_nonBinary = file.read()
chunked_nonBinary = nltk.ne_chunk_sents(
[nltk.pos_tag(sent) for sent in
[word_tokenize(sent) for sent in
sent_tokenize(article_nonBinary)]], binary=False)
for sent in chunked_nonBinary:
for chunk in sent:
if hasattr(chunk, "label"):
ner_categories[chunk.label()] += 1
labels = list(ner_categories.keys())
values = [ner_categories.get(l) for l in labels]
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)
plt.show()
SpaCy - NLP library similar to Gensim
- Focus on creating pipeline to generate models and corpora
- Focus on GTD not academic - ONLY ONE NER per langue, no-frills!
- So what’s academic stuff?
- Focus not Edge-Cutting algo, NLTK focus on giving scholars toolkit to play around
- FEATURES OF SPACY
- Non-destructive tokenisation
- 21+ langues
- 6 statsmodels for 5 langues
- pre-trained WordVEC
- esy DeepLearning integration
- POS tagging
- NER
- Labeled DEPENDENCY parsing
- Syntax-drven sentence segmentaion
- Built-in visual for syntax and NER
- easy string-to-hash mapping
- export to NPArray
- Efficient binary SERIALISATION
- easy model pkg and deployment
- Speed
- Robust, rigorously evaluated accuracy
- e.g. Visualiser online by
Displacy
- NER to load
- Including advanced German and Chinese
Why SpaCy for NER
- easy pipelineing
- different entity types
- informal corpora - tweets chat
- quicly growing
Language Model - stats model for NLP tasks
- need download separately
- language specific
- installation
spacy download en / de / es / fr / xx # multi-langue
spacy download en_core_web_sm # best mactching version of specific model for spacy
- Loading language model
import spacy
nlp = spacy.load('en')
# process text via pipeline
doc = nlp(u'This is a sentence.') # u can be ignore in Python3
Basic Cleaning by SpaCy
calling above on U-text spaCy first TOKENISES text to make a Doc Project, then processed in 7 steps
- tokenizer
- tensoriser
- tagger
- parser
- NER
- output Doc
Tokenizing Text
- splitting text into meaningful tokens/parts - words, puncs, numbers, special char, building blocks
- More than .split() - syntax-aware split, don’t or U.K.
- “don’t” -> 2 token {ORTH: “do”} and {“ORTH”: “n’t”, LEMMA: “not”}
- ORTH refers to textual content, LEMMA the word with no information suffix
- Creating own Tokenisers in Linguisitic Features
- Once done, Doc obj comprising tokens each is then worked on by other components of the PIPELINE
POS tagging - tensorizer
- encode internal repr of doc as ARRAY of floats, necessary for NN need tensors
- Mark token of sentence with proper part of speech
- use Stats Models to perform POS tagging
for token in doc:
token.text, token.pos_
Dependency Parsing
- while parsing refers to any analysis of string of symbols to understand relationship, dependency parsing emphasises on DEPENDENCY
- Subject or Object Noun
NER
- real-world object with name
- spacy built-in training but may need tuning/training
for ent in doc.ents:
ent.text, ent.start_char, ent.end_char, ent.label_
- Built-int NE types PERSON, NORP, FACILITY, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LAW, LANGUAGE
Rule-based matching
ORTH, LOWER,UPPER, IS_ALPHA, IS_ASCII, IS_DIGIT, IS_PUNCT, IS_SPACE, IS_STOP, LIKE_NUM, LIKE_URL, LIKE_EMAIL, POS, TG, DEP, LEMMA, SHAPE
- default pipeline perform further annotating tokens with more info
- self-defined rule is possible
Preprocessing
- stop-word by
token.IS_STOP
attribute boolean - self-defined stoppers
my_stops = ['say', 'be', 'said', 'says', 'saying', 'field']
for stopword in my_stops:
lexeme = nlp.vocab[stopword]
lexeme.is_stop = True
# alternatively
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)
STOP_WORDS.add("additioanl here")
-
STEMMING & LEMMATISATION
- Stemming often chops off end of word following basic rules / CONTEXTLESS not POS-based
- Lemmatisation however conducts MORPHOLOGICAL analysis to find root word
- Stanford NLP book explains
- lemma accessed
.lemma_
attribute
-
Basic cleaning
doc = nlp('some text')
sentence = []
for w in doc:
if w.text != 'n' and not w.is_stop and not w.is_punct and not w.like_num:
sentence.append(w.lemma_)
print(sentence)
- removing trash NOTR appending lemmatised form of word !!
- further remove based on need, e.g. removing all VERB via checking POS tag of token !
RECAP: spaCy pipeline annotates text easily to retain info to process analysis, always first starting task in NLP
- possible annotating text with LOTs of info (tokenisation, stoppers, POS, NER, etc)
- possible TRAINING annotationg models on own, power to language models and processing pipeline!
GENSIM - Vectorising Text and N-grams
- VECTORISING TEXT - BOW, TF-IDF, LSI (latent semantic indexing), WORD2VEC
GENSIM included novel kits like LDA (Latent Dirichlet allocation), Latent Semantic Analysis, Random projection, Hierarchical Dirichlet process, word2vec deep learning, cluster computing
- Memory-efficient, scalable (generators/iterators, most IR algo ~ Matrix Decompo)
- Documentation
Vectorised Word
BOW - most straightforward
- Word-Freq mapped against VOCAB
- OrderLESS NO Spatial INFO - or semantics
- BUT many cases no need for Spatial INFO in Information Retrieval Algo
- EX: spam filtering via Naive Bayes Classifier
TF-IDF
- Largely used in search engines to find relevant docs based on query
- Weigthed Frequency against Occurences
Other Forms
- Topic Models
- The Amazing Power of Word Vectors
Vectorisation in GENSIM
from gensim import corpora
documents = [u'list of text']
import spacy
nlp = spacy.load('en')
texts = []
for document in documents:
text = []
doc = nlp(document)
for w in doc:
if not w.is_stop and not w.is_punct and not w.like_num:
text.append(w.lemma_)
texts.append(text)
print(texts)
# whipping up BOW for mini-corpus
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
- A List of List each repr BOW
(word_id, word_count)
a tuple - NOTE
- Above case is fully RAM-loaded, in production need to store corpus
corpora.MmCorpus.serialize('/tmp/example.mm', corpus)
- Thus stored on disk away from RAM, at most one vector resides in RAM (tutorial)
- From BOW into TF-IDF: tfidf is a table TRAINED on own corpus, involving simply going through supplied corpus \n once and computing df on all features (other models, LSA or LDA, much more invovled)
from gensim import models tfidf = models.TfidfModel(corpus) # Check the result for document in tfidf[corpus]: print(document)
- Score (0,1) measuring importance of word in corpus - used in ML models or further chain/link vectors by performing other transformation on them
N-gram plus more preprocessing
- Calculated by conditional proba around words called collocation based on corpus provided
- This extra preprocess precedes DOC2BOW dictionary!
bigram = gensim.models.Phrases(texts) # resulting a trained bi-gram model for corpus, then transformation on new text
texts = [bigram[line] for line in texts] # each line having all possible bi-grams created
RECAP
- Preprocess can be as complex as need dictates
- EX Removing both high-freq and low-freq words (GENSIM
dictionary
module)
Rid of occurence < 20 documents, or in > 50% of documents
dictionary.filter_extremes(no_below=20, no_above=0.5)
- EX remove most-freq tokens or prune out certain token IDs example
POS-Tagging and Applications
- What
- Not possible to tag POS unless in sentence or phrase
- SpaCy has 19 POS tags
.tag_
attr and.pos_
- Brown Corpus used HMM to predict tags (HMM - sequentail model)
- Current
- Stats model and Deep Learning ACL list of results
- spaCy early tagger is averaged perceptron
- Why
- Historically speech-to-text conversion / translation disambiguate homonyms
- Dependency parsing
- Demo by SpaCy Display
- PYTHONIC
- NLTK the main rival POS-tagger
import nltk text = nltk.word_tokenize("And now for something completely different") nltk.pos_tag(text) bigram_tagger = nltk.BigramTagger(train_sents) # one of many tagger options in NLTK bigram_tagger.tag(text)
- Resources Official Doc of tag module, NLTK book, Training POS
- Other Modules AI in Practice: Identifying Parts of Speech in Python
- TEXTBLOB is likely the ONLY other POS tagger worth a look, performing similarly to one in SpaCy - algo written by spaCy maintainer Detail
POS-Tagging in SpaCy (97% accuracy battery-packed)
- Built-in tagging in PIPELINE (from spacy.load(‘en’) and loading text,
for token in sent: token.text, token.pos_, token.tag_
- EX fishy a tricky word possible multi-POS but correctly machine-learned by multiple FEATURES (e.g. surrending POS, suffix-prefix, etc.)
- SpaCy returns kills many matephorical birds with the same stone!
Adding Defined Training Models
- Probabilistically improvable to relevant context and data - Official SpaCy Traininng process
- Training Loop:
- Provide text + part to train (entities, heads, deps, tags, cats)
TRAIN_DATA = [
("Facebook has been accused for leaking personal data of users.", {"entities": [(0, 8, 'ORG')]}),
...] # Facebook is the entity marked as ORG (start_index, end_index)
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotateions], sgd = optimizer)
nlp.to_disk('/model')
- Trainng POS-tagger example code
- init dictionary, define mapping from data’s POS to Universsal POS tag set
- More data better accuracy…
# nltk for POS-tagging
import nltk
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(text)
import spacy
nlp = spacy.load('en')
sent_0 = nlp(u'Mathieu and I went to the park.')
sent_1 = nlp(u'If Clement was asked to take out the garbage, he would refuse.')
sent_2 = nlp(u'Baptiste was in charge of the refuse treatment center.')
sent_3 = nlp(u'Marie took out her rather suspicious and fishy cat to go fish for fish.')
for token in sent_0:
print(token.text, token.pos_, token.tag_)
for token in sent_1:
print(token.text, token.pos_, token.tag_)
for token in sent_2:
print(token.text, token.pos_, token.tag_)
for token in sent_3:
print(token.text, token.pos_, token.tag_)
# training NER
TRAIN_DATA = [
("Facebook has been accused for leaking personal data of users.", {'entities': [(0, 8, 'ORG')]}),
("Tinder uses sophisticated algorithms to find the perfect match.", {'entities': [(0, 6, "ORG")]})]
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
## run this code as a seperate file
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
# See here for the Universal Tag Set:
# http://universaldependencies.github.io/docs/u/pos/index.html
# You may also specify morphological features for your tags, from the universal
# scheme.
TAG_MAP = {
'N': {'pos': 'NOUN'},
'V': {'pos': 'VERB'},
'J': {'pos': 'ADJ'}
}
# Usually you'll read this in, of course. Data formats vary. Ensure your
# strings are unicode and that the number of tags assigned matches spaCy's
# tokenization. If not, you can always add a 'words' key to the annotations
# that specifies the gold-standard tokenization, e.g.:
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'] 'tags': ['V', 'J', 'N']})
TRAIN_DATA = [
("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}),
("Eat blue ham", {'tags': ['V', 'J', 'N']})
]
@plac.annotations(
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(lang='en', output_dir=None, n_iter=25):
"""Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.
"""
nlp = spacy.blank(lang)
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe('tagger')
# Add the tags. This needs to be done before you start training.
for tag, values in TAG_MAP.items():
tagger.add_label(tag, values)
nlp.add_pipe(tagger)
optimizer = nlp.begin_training()
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_text = "I like blue eggs"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# [
# ('I', 'N', 'NOUN'),
# ('like', 'V', 'VERB'),
# ('blue', 'J', 'ADJ'),
# ('eggs', 'N', 'NOUN')
# ]
- For real-case, training data massisve and assembling it a huge part of training work
- This case the ML model abstracted as
update()
only that it works well and a NN - For advanced this blog offers NLTK process using self-defined classifiers from SKL
- The SpaCy POS training model A Good Part-Of-Speech Tagger in about 200 Lines of Python same as Textblob
Small Examples of POS usage
- Change all Verbs to uppercase
def make_verb_upper(text, pos): return text.upper() if pos == "VERB" else text doc = nlp(u'Tom ran swiftly and walked slowly') text = ''.join(make_verb_upper(w.text_with_ws, w.pos_) for w in doc) print(text)
- Count of POS
harry_potter = open("HP1.txt").read() hp = nlp(harry_potter) hpSents = list(hp.sents) hpSentenceLenghts = [len(sent) for sent in hpSents] [sent for sent in hSents if len(sent) == max(hpSentencLenghts)] hpPOS = pd.Series(hp.count_by(spacy.attrs.POS)) / len(hp) tagDict = {w.pos: w.pos_ for w in hp} hpPOS = pd.Series(hp.count)By(spacy.attrs.POS)) / len(hp) df = pd.DataFrame([hpPOS]], index = ['Harry Potter']) df.columns = [tagDict[column] for column in df.columns] df.T.plot(kind='bar') # most common pronoun hpAdjs = [w for w in hp if w.pos_ == 'PRON'] Counter([w.string.strip() for w in hpAdjs]).most_common(10)
Knowldge of POS-tags gives more in-depth text analysis, a pillar of NLP, and after tokenzisng text often the first piece of analysis
NER-Tagging and Applications
- Real-world object GPE, PER, ORG etc
- Named Entity Disambiguation NED
- Unlike POS-tagging, NER-tagging is CONTEXT AND DOMAIN BASED
- CONDITIONAL RANDOM FIELDS OFTEN USED TO TRAIN NER-TAGGER CRF: Probabilistic Models for Segmenting and Labeling Sequence Data
NER-Tagging in SpaCy (skipped NLTK case)
the power of SpaCy battery-packed pipeline when loading pre-trained model, all of the above mentioned + dependency parsing are produced from that single method spacy.load() - POS, NER
for token in sent_0: # after nlp()
token.text, token.ent_type_
# recall SpaCy intends user to access entities in doc.ents streamable object - slice of Doc class is called Span class
for ent in sent_0.ents:
ent.text, ent.label_
NER-Tagging Training in SpaCy
# nltk for NER-tagging
from nltk.chunk import conlltags2tree, tree2conlltags
sentence = "Clement and Mathieu are working at Apple."
ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))
iob_tagged = tree2conlltags(ne_tree)
print iob_tagged
ne_tree = conlltags2tree(iob_tagged)
print ne_tree
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', '/usr/share/stanford-ner/stanford-ner.jar', encoding='utf-8')
st.tag(‘Baptiste Capdeville is studying at Columbia University in NY’.split())
import spacy
nlp = spacy.load('en')
sent_0 = nlp(u'Donald Trump visited at the government headquarters in France today.')
sent_1 = nlp(u'Emmanuel Jean-Michel Frédéric Macron is a French politician serving as President of France and ex officio Co-Prince of Andorra since 14 May 2017.')
sent_2 = nlp(u'He studied philosophy at Paris Nanterre University, completed a Master’s of Public Affairs at Sciences Po, and graduated from the École nationale d\'administration (ÉNA) in 2004. ')
sent_3 = nlp(u'He worked at the Inspectorate General of Finances, and later became an investment banker at Rothschild & Cie Banque.')
for token in sent_0:
print(token.text, token.ent_type_)
for ent in sent_0.ents:
print(ent.text, ent.label_)
for token in sent_1:
print(token.text, token.ent_type_)
for ent in sent_1.ents:
print(ent.text, ent.label_)
for token in sent_2:
print(token.text, token.ent_type_)
for ent in sent_2.ents:
print(ent.text, ent.label_)
for token in sent_3:
print(token.text, token.ent_type_)
for ent in sent_3.ents:
print(ent.text, ent.label_)
2 Training Example: Blank model and Adding New Entity
The same princiles as POS-tagger
- Start by adding NER label to pipeline
- Disabling all other components of PIPI so that only train/update NER-Tagger
- Training is backend, API by nlp.update()
# run this code seperately
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
# training data
TRAIN_DATA = [
('Who is Shaka Khan?', {
'entities': [(7, 17, 'PERSON')]
}),
('I like London and Berlin.', {
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
})
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe('ner')
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# Entities [('Shaka Khan', 'PERSON')]
# Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
# ('Khan', 'PERSON', 1), ('?', '', 2)]
# Entities [('London', 'LOC'), ('Berlin', 'LOC')]
# Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
# ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]
Adding New Class to Model
Same principle
- Load model, disable PIPE not updating
- Add new label, then loop over the examples and update them
- Actual training is performed by looping over the examples and calling
nlp.entity.update()
update()
predict each word then consults annotations provided onGoldParse
instance to check- If wrong, adjusting weight to correct
# run this code seperately:
"""Example of training an additional entity type
This script shows how to add a new entity type to an existing pre-trained NER
model. To keep the example short and simple, only four sentences are provided
as examples. In practice, you'll need many more — a few hundred would be a
good start. You will also likely need to mix in examples of other entity
types, which might be obtained by running the entity recognizer over unlabelled
sentences, and adding their annotations to the training set.
The actual training is performed by looping over the examples, and calling
`nlp.entity.update()`. The `update()` method steps through the words of the
input. At each word, it makes a prediction. It then consults the annotations
provided on the GoldParse instance, to see whether it was right. If it was
wrong, it adjusts its weights so that the correct action will score higher
next time.
After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.
For more details, see the documentation:
* Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
# new entity label
LABEL = 'ANIMAL'
# training data
# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("Do they bite?", {
'entities': []
}),
("horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("horses pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("they pretend to care about your feelings, those horses", {
'entities': [(48, 54, 'ANIMAL')]
}),
("horses?", {
'entities': [(0, 6, 'ANIMAL')]
})
]
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
"""Set up the pipeline and entity recognizer, and train the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# otherwise, get it, so we can add labels to it
else:
ner = nlp.get_pipe('ner')
ner.add_label(LABEL) # add new entity label to entity recognizer
if model is None:
optimizer = nlp.begin_training()
else:
# Note that 'begin_training' initializes the models, so it'll zero out
# existing entity types.
optimizer = nlp.entity.create_optimizer()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, drop=0.35,
losses=losses)
print(losses)
# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.meta['name'] = new_model_name # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)
if __name__ == '__main__':
plac.call(main)
The rest code remains the same logic, CRUCIAL DIFFERENCE is training data, ADDING NEW CLASS, and considering need to ADD OLDER EXAMPLES TOO
- spaCy’s NER linguistic features with useful advice on HOW TO SET ENTITY ANNOTATIONS
- spaCy main PIPELINE, offers customisation at each STEP
- Backend is statistical model accepting FEATURES making PRED
- There’re tuto on how to build CLASSIFIER or update NLTK clf
- complete guide to building own NER, Intro to NER, Performing Sequence Labelling using CRF
VISUALISATION
DEPENDENCY PARSING
- Parsing is possible on any form/kind with formal GRAMMAR
- 2 Kinds TRADITIONAL UNDERSTANDING vs COMPUTATIONAL LINGUISTICS formal analysis by algo in PARSE TREE
- 2 SCHOOLS IN TRADITIONAL
- Dependency Parsing VS Phrase Structure Parsing
DP is new approach credited to French linguist Lucien Tesnière
- Constituency Parsing on the other hand is older to Aristole’s ideas on term logic.
- Formally credited to Noam Chomsky, father of linguistics
- IDEA: words in sentences are connected to each other with directed links - info about relationship between words
- IDEA: phrase structure parsing, break up into prahses or constituents - grouping sentences
- Spacy uses SYNTACTIC PARSING
Dependency Parsing in Python (NLTK)
- NLTK provides most options in parsing methods, BUT forced to pass own GRAMMAR for effective results
- Not purpose to learn grammars before run compLing algo
- Demo below is how Stanford Dependency Parser wrapped NLTK
- First step to download necessary JAR files Historical Stanford Statistical Parser
# NLTK example, be sure to download JAR
from nltk.parse.stanford import StanfordDependencyParser
path_to_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_to_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'
dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)
result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result._next_()
list(dep.triples())
Dependency Parsing in SpaCy
- Parsing part of PIPELINE does both PHRASAL and DEPENDENCY parsing - able to get info about what NOUN and VERB chunks in sentence are as well as info on dependencies between words
- Phrasal parsing can also be referred to as chunking, part of sentences or phrases
noun_chunks
attribute
# spaCy
import spacy
nlp = spacy.load('en')
sent_0 = nlp(u'Myriam saw Clement with a telescope.')
sent_1 = nlp(u'Self-driving cars shift insurance liability toward manufacturers.')
sent_2 = nlp(u'I shot the elephant in my pyjamas.')
for chunk in sent_0.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
for chunk in sent_1.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
for chunk in sent_2.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text)
for token in sent_0:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
for token in sent_1:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
for token in sent_2:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
from spacy.symbols import nsubj, VERB
# Other ways navigating tree - identify one head per sentence via iterating possible subjects instead of verbs
verbs = set()
for possible_subject in sent_1:
if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
verbs.add(possible_subject.head)
# Iterated through all words and checked cases where a nomial subject and head is verb
# possibel to search verbs directly but by double-iteration
verbs = []
for possible_verb in doc:
if possible_verb.pos == VERB:
for possible_subject in possible_verb.children:
if possible_subject.dep == nsubj:
verbs.append(possible_verb)
break
# Also useful ATTR lefts, rights, n_rigths ,n_lefts giving info about particular token in tree
# example to finding phrases using syntactic head
root = [token for token in sent_1 if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
assert subject is descendant or subject.is_ancestor(descendant)
print(descendant.text, descendant.dep_, descendant.n_lefts, descendant.n_rights,
[ancestor.text for ancestor in descendant.ancestors])
# Find root by seeing where head is token-itself. Subject to the left of tree, iterate subject priting descenddants and number of leaves
# above more realistic finding commonly used ADJ to describe a character in a book
adjectives = []
for sent in book.sents:
for word in sent:
if 'Character' in word.string:
for child in word.children:
if child.pos_ == 'ADJ': adjectives.append(child.string.strip())
Counter(adjectives).most_common(10)
- Code remains simple but effective - iterating over books sentences, looking for character of sentence, children of character
- Then check if child is an ADJ. Being child means likely marked as DEP, with root word (i.e. Character) described by child
- By checking most common ADJ to mini-analyse characters of books
Training DEP Parsers
- Again, SpaCy abstruct the hardest part of ML - selecting features
- All left to do is inputting proper training data and set up API to update models
# run the next code as a seperate file
"""Example of training spaCy dependency parser, starting off with an existing
model or a blank model. For more details, see the documentation:
* Training: https://spacy.io/usage/training
* Dependency Parse: https://spacy.io/usage/linguistic-features#dependency-parse
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
# training data
TRAIN_DATA = [
("They trade mortgage-backed securities.", {
'heads': [1, 1, 4, 4, 5, 1, 1],
'deps': ['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj', 'punct']
}),
("I like London and Berlin.", {
'heads': [1, 1, 1, 2, 2, 1],
'deps': ['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']
})
]
# give exmaples of heads and dep label - i.e. verb is word at index 0, DEP clearly defined
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=10):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe('parser')
# add labels to the parser
for _, annotations in TRAIN_DATA:
for dep in annotations.get('deps', []):
parser.add_label(dep)
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
if __name__ == '__main__':
plac.call(main)
# expected result:
# [
# ('I', 'nsubj', 'like'),
# ('like', 'ROOT', 'like'),
# ('securities', 'dobj', 'like'),
# ('.', 'punct', 'like')
# ]
"""Above rather vanilla, below follows same style as POS and NER, but adding custom semantics.
WHY? Train parsers to understand new semantic relationships or DEP among words.
Particularly interesting as able to model own DEP FOR USE-CASE; BUT caution it may not alreays output CORRECT DEP.
"""
# run the next code as a seperate file
#!/usr/bin/env python
# coding: utf-8
"""Using the parser to recognise your own semantics
spaCy's parser component can be used to trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common
"chat intent": finding local businesses. Our message semantics will have the
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.
"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function
import plac
import random
import spacy
from pathlib import Path
# training data: texts, heads and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
("find a cafe with great wifi", {
'heads': [0, 2, 0, 5, 5, 2], # index of token head
'deps': ['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
}),
("find a hotel near the beach", {
'heads': [0, 2, 0, 5, 5, 2],
'deps': ['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
}),
("find me the closest gym that's open late", {
'heads': [0, 0, 4, 4, 0, 6, 4, 6, 6],
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'ATTRIBUTE', 'TIME']
}),
("show me the cheapest store that sells flowers", {
'heads': [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
}),
("find a nice restaurant in london", {
'heads': [0, 3, 3, 0, 3, 3],
'deps': ['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
}),
("show me the coolest hostel in berlin", {
'heads': [0, 0, 4, 4, 0, 4, 4],
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
}),
("find a good italian restaurant near work", {
'heads': [0, 4, 4, 4, 0, 4, 5],
'deps': ['ROOT', '-', 'QUALITY', 'ATTRIBUTE', 'PLACE', 'ATTRIBUTE', 'LOCATION']
})
]
# Worthwhile to check training data, those new DEP shows qualities in examples
# These feedbacks are vital in building custom semantics information graph
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=5):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# We'll use the built-in dependency parser class, but we want to create a
# fresh instance – just in case.
if 'parser' in nlp.pipe_names:
nlp.remove_pipe('parser')
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
for text, annotations in TRAIN_DATA:
for dep in annotations.get('deps', []):
parser.add_label(dep)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_model(nlp)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
test_model(nlp2)
def test_model(nlp):
texts = ["find a hotel with good wifi",
"find me the cheapest gym near work",
"show me the best hotel in berlin"]
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])
if __name__ == '__main__':
plac.call(main)
# Expected output:
# find a hotel with good wifi
# [
# ('find', 'ROOT', 'find'),
# ('hotel', 'PLACE', 'find'),
# ('good', 'QUALITY', 'wifi'),
# ('wifi', 'ATTRIBUTE', 'hotel')
# ]
# find me the cheapest gym near work
# [
# ('find', 'ROOT', 'find'),
# ('cheapest', 'QUALITY', 'gym'),
# ('gym', 'PLACE', 'find')
# ('work', 'LOCATION', 'near')
# ]
# show me the best hotel in berlin
# [
# ('show', 'ROOT', 'show'),
# ('best', 'QUALITY', 'hotel'),
# ('hotel', 'PLACE', 'show'),
# ('berlin', 'LOCATION', 'hotel')
# ]
Example illustrate real power of spaCy in creating custom models, both retrain model with domain ken and traing completely new DEP
Topic Models
- Previous dealt with CompLing algo and SpaCy, how to use them to ANNOTATE data and decipher sentence structure, finer details of text
- BUT BIG PICTURE and theme!
- Definition
Probabilistic model having info on topics in the text 1. Topic can be theme, or underlying ideas EX corpus of news topically weather, politics, sports etc 2. Useful to Represent documents as topic DISTRIBUTIONS ! (instead of BOW or TF-IDF) 3. Cluster in topics, further zoom in one topic to decipher deeper topics/themes !! 4. Chronological / time-stamped Topic variation reveals info (DYNAMIC TOPIC MODELING) 5. NOTE TOPIC ~ Distribution(Words) NOT labelled or titled (e.g. weather is a collection of sun, temperature, storm, forecast) 6. Human assign topic from Distribution 7. Theoretical Papers Blei LDA, Edwin Chen, Blei Probabilistic Topic Models
Topic Models in GENSIM
- Arguably most popular TM due to many algos
- Hierarchical Dirichlet Process non-parametric no hyperparam of topic-number
- DTM - time-stamped evolution of topics
- Unlikely see underlying topic change but prominence and replacement
- GENSIM notebook
Topic Model in SKL
- Fast LDA and NMF (NonNegative Matrix Factorization)
- Differing to GENSIM
- Perplexity bounds are not expected to agree exactly here as bound is computed differently, how topics CONVERGE in TM algos
- SKL uses CYTHON in making numerical 6th decimal point differences
- NMF is LinAlg reconstructing a Single Matrix V into W and H, used to identify topics as they best represent original V - document matrix having info on words in docs
- Positive-Semi-Definite is inherent property in audio / text processing - insolvable in closed-form but numerically approx, by DISTANCE NORM Euclidean Norm2 often, and Kullback-Leibler Divergence
- NMF used for DimReduction, source separation, topic extraction, etc - this example uses generalised KL divergence, equivalent to Probabilistic Latent Sementic Indexing PLSI
- SLK consistent pipelien fit, transform, predict DECOMPOSITION ONLY USE FIT then extract components
# we need to first set up the text and corpus as it was done in section 3.3
# this refers to the code set-up in the Chapter 3
from gensim.models import LdaModel
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
ldamodel.show_topics()
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
lsimodel.show_topics(num_topics=5) # Showing only the top 5 topics
hdpmodel = HdpModel(corpus=corpus, id2word=dictionary)
hdpmodel.show_topics()
from sklearn.decomposition import NMF, LatentDirichletAllocation
nmf = NMF(n_components=no_topic).fit(tfidf_corpus)
lda = LatentDirichletAllocation(n_topics=no_topics).fit(tf_corpus)
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print "Topic %d:" % (topic_idx)
print " ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]])
display_topics(nmf, tfidf_feature_names, no_top_words)
Advanced Topic Model
- Previous Preprocessing (Language Model and Vectorising-Transformation) geared more towards generating TM than other forms of Text Analysis Algo
- E.g. Lemmatisation instead of Stemming especially useful in TM as lemmatised words tend to be more legible than stemming.
- Similary bi-grams or tri-grams as part of CORPUS before applying TM give further legbility
- TM ultimately is human-understanding, unlike Clustering only higher accuracy
- Any preprocessing customised in pipeline conducive to that goal is preferred
- Multiple Runs of TM may required before any meaningful result, e.g. adding new stop words after viewing first TM
- EX removing lemmatised SAY
my_stop = [u'say', u'\'s', u'Mr', u'be', u'said', u'says', u'saying'] for stopwrod in my_stop: lexeme = nlp.vocab[stopword] lexeme.is_stop = True
For every word to add as stop, change
is_stop
attr for thatlexeme
class, which are case-insensitive, so ignorable - A more common way to remove stop is to put all in list and remove from Corpus (e.g. in NLTK
from nltk.corpus import stopwords; stopword_list = stopwords.words("english")
- Another way is GENSIM
Dictionary
classfilter_n_most_frequent(remove_n) from gensim.corpora import Dictionary corpus = [[ 'mama', 'mela', 'maso'], ['ema', 'ma', 'mama']] dct = Dictionary(corpus) dct.filter_n_most_frequent(2)
this process of TM, often manually inspecting and change as need is common in almost all ML or DS projects, in text, the extra is human interpretable nature of results
HyperParam in TM
-
GENSIM
chunksize
controls # doc processed at once in training algo - speed for RAM fitpasses
controls how often train model on entire corpus or epochsiterations
controls freq repeating a loop over each doc, often higher
-
LdaModel and LDA in SKL
- Alpha repr doc-topic density, higher the more topics
- Beta repr topic-word density
- Numer of topics
-
Logging is useful to monitor during training (GENSIM)
import logging logging.basicConfig(filename='logfile.log', format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # document - topic proportions ldamodel[corpus[0]] # printing first topic ldamodel.show_topics()[1] texts = [['bank','river','shore','water'], ['river','water','flow','fast','tree'], ['bank','water','fall','flow'], ['bank','bank','water','rain','river'], ['river','water','mud','tree'], ['money','transaction','bank','finance'], ['bank','borrow','money'], ['bank','finance'], ['finance','money','sell','bank'], ['borrow','sell'], ['bank','loan','sell']] model.get_term_topics('water') model.get_term_topics('finance') bow_water = ['bank','water','bank'] bow_finance = ['bank','finance','bank'] bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
-
GENSIM FAQ and Chris Tufts blog
Exploring Documents after Satisfactory TM runs
As shown, based on context, most likely topics associated with a word can vary, differing from
get_term_topics
where it is a STATIC Topic Distribution 1. NOTE GENSIM implementation of LDA uses VARIATIONAL BAYES SAMPLING, aword_type
in doc is onlly given one Topic Distribution. E.g.the bank by the river bank
is likely to be assigned to topic_0 and each of bank word instances has the same distribution 2. These 2 methods ensemble to infer further info from using TM - topic distribution means able to use info to do some visualisation - colour all words in doc based on which topic belonging to, or usng distance metrics to infer how close or far pairs of topics are
Topic Coherence and Evaluation
-
Previous more Qualitative measures
Topic Coherence overview and Exploring Space of Topic Coherence
-
In essence, TC is quantative measure of TM, making possible comparing two models RIGHT, OPTIMAL NUMBER OF TOPICS is the goal
-
Before Topic Coherence, perplexity used to measure model fit
-
Resources
# coherence models
lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=texts, dictionary=dictionary, window_size=10)
hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=texts, dictionary=dictionary, window_size=10)
lda_coherence = CoherenceModel(topics=ldatopics, texts=texts, dictionary=dictionary, window_size=10)
# train two models, one poorly trained (1 pass), and one trained with more passes (50 passes)
print(goodcm.get_coherence())
print(badcm.get_coherence())
c_v = []
for num_topics in range(1, limit):
lm = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
cm = CoherenceModel(model=lm, texts=texts, dictionary=dictionary, coherence='c_v')
c_v.append(cm.get_coherence())
Visualising TM
-
TM better understood qualitatively on textual data - visual is best ways to check
-
pyLDAvis
agnostic to model trained - beyond GENSIM or LDA - only require topic-term distributions and document-topic distributions plus basic info on corpus trained onimport pyLDAvis.gensim pyLDAvis.gensim.prepare(model, corpus, dictionary)
- Model is palceholder for trained LDA model for example
- Full notebook
-
Visualising Live Training Model (coherence, perplexity, etc)
- visdom server
- GENSIM Setup
- Further viewed as CLUSTER by T-SNE
- Also Clustering via WORD2VEC
- Dendrograms
-
Visual Resources
Clustering and Classifying
Clustering
- RECAP
so far processing text or corpus via POS, NER, what kind of words present; in TM to seek theme hidden; 1. TM could be used to cluster articles, BUT it is NOT its purpose! 2. E.g. after performing TM, a doc can be made of 30% topic 1, 30% topic 2, etc, hence no way to cluster
- Datapoint as documents or words
- EXTRA CAUTION of Text: high number of dimension in text vector !! - entire vocab or corpus (Best effort via of DimReduction via SVD, LDA, LSI, etc)
- Pipeline: rid of stop, lemmatise, vectorise
- Example
- Doc2Vec Clustering
# using scikit-learn
from sklearn.datasets import fetch_20newsgroups
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
data = dataset.data
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english', use_idf=True)
X = vectorizer.fit_transform(data)
from sklearn.decomposition import PCA
newsgroups_train = fetch_20newsgroups(subset='train',
categories=['alt.atheism', 'sci.space'])
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
])
X_visualise = pipeline.fit_transform(newsgroups_train.data).todense()
pca = PCA(n_components=2).fit(X_visualise)
data2D = pca.transform(X_visualise)
plt.scatter(data2D[:,0], data2D[:,1], c=newsgroups_train.target)
n_components = 5
svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
X = lsa.fit_transform(X)
Minibatch = True
if minibatch:
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1, init_size=1000, batch_size=1000)
else:
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(X)
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
# [The above bit of code is necessary because of our LSI transformation]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(X)
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist)
fig, ax = plt.subplots(figsize=(10, 15)) # set size
ax = dendrogram(linkage_matrix, orientation="right")
Classifying
# classification
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X, labels)
from sklearn.svm import SVC
svm = SVC()
svm.fit(X, labels)
Similarity Queries and Summarisation
- Vectorised Text opens door to simiarlity or distance
Similarity Metrics
Similarity Queries
- extract out most similar for an input query - simply index each of doc then search for lowest distance returned between corpus and query, and return the docu with lowest distance
# make sure to have appropriate gensim installations and imports done
texts = [['bank','river','shore','water'],
['river','water','flow','fast','tree'],
['bank','water','fall','flow'],
['bank','bank','water','rain','river'],
['river','water','mud','tree'],
['money','transaction','bank','finance'],
['bank','borrow','money'],
['bank','finance'],
['finance','money','sell','bank'],
['borrow','sell'],
['bank', 'loan', 'sell']
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = TfidfModel(corpus)
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2)
model.show_topics()
doc_water = ['river', 'water', 'shore']
doc_finance = ['finance', 'money', 'sell']
doc_bank = ['finance', 'bank', 'tree', 'water']
bow_water = model.id2word.doc2bow(doc_water)
bow_finance = model.id2word.doc2bow(doc_finance)
bow_bank = model.id2word.doc2bow(doc_bank)
lda_bow_water = model[bow_water]
lda_bow_finance = model[bow_finance]
lda_bow_bank = model[bow_bank]
tfidf_bow_water = tfidf[bow_water]
tfidf_bow_finance = tfidf[bow_finance]
tfidf_bow_bank = tfidf[bow_bank]
from gensim.matutils import kullback_leibler, jaccard, hellinger
hellinger(lda_bow_water, lda_bow_finance)
hellinger(lda_bow_finance, lda_bow_bank)
hellinger(lda_bow_bank, lda_bow_water)
hellinger(lda_bow_finance, lda_bow_water)
kullback_leibler(lda_bow_water, lda_bow_bank)
kullback_leibler(lda_bow_bank, lda_bow_water)
jaccard(bow_water, bow_bank)
jaccard(doc_water, doc_bank)
jaccard(['word'], ['word'])
def make_topics_bow(topic):
# takes the string returned by model.show_topics()
# split on strings to get topics and the probabilities
topic = topic.split('+')
# list to store topic bows
topic_bow = []
for word in topic:
# split probability and word
prob, word = word.split('*')
# get rid of spaces
word = word.replace(" ","")
# convert to word_type
word = model.id2word.doc2bow([word])[0][0]
topic_bow.append((word, float(prob)))
return topic_bow
topic_water, topic_finance = model.show_topics()
finance_distribution = make_topics_bow(topic_finance[1])
water_distribution = make_topics_bow(topic_water[1])
hellinger(water_distribution, finance_distribution)
from gensim import similarities
index = similarities.MatrixSimilarity(model[corpus])
sims = index[lda_bow_finance]
print(list(enumerate(sims)))
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_id, similarity in sims:
print texts[doc_id], similarity
from gensim.summarization import summarize
print (summarize(text))
print (summarize(text, word_count=50))
from gensim.summarization import keywords
print (keywords(text))
from gensim.summarization import mz_keywords
mz_keywords(text,scores=True,weighted=False,threshold=1.0)
- Resources
Summarising Text
- GENSIM algo TextRank from Mihalcea
- Improved BM25 Ranking Function
- Montemurro and Zanettes MZ entropy-based keyword extraction algo
Word2Vec, Doc2Vec in GENSIM
- Word Embedding magic W2V is how it manages to capture semantic repr of wrods in a vector based on many papers
- V(King) - V(Man) + V(Woman) approx V(Queen) or V(Vietname) + V(Capital) approx V(Hanoi)
- Concept
W2V sliding window size attempting to ID Cond-Proba of observing output word based on adjacent ones EX
- Two Methods for W2V training
- Continuous BOW (CBOW)
- Skip Gram - Word2Vec Tutorial\
- The amazing power of word vectors
- Resources Page while it remians most POPULAR word vectoriser, not first time attempted not last - others to follow
- Two Methods for W2V training
W2V with GENSIM
- Code history
- Online Interactive Tutorial
-
Importancy of
word2vec
class andKeyedVector
which tuning relies heavily onList of Params for
word2vec.Word2Vec
1. SG defines algo default=0 CBOW used or =1 Skip-gram 2. SIZE dimensionality of feature vectors 3. WINDOW max distance entre current-predicted word within a sentence 4. ALPHA initial learning rate (linearly drop tomin_alpha
) 5. SEED randNumGenerator, initial vectors for each word seeded with hash of concatenation of word + str(seed), NOTE for fully deterministically reproducible run, must also LIMIT the model to SINGLE WORKER THREAD, eliminating ordering jitter from OS thread scheduling 6. MIN_COUNT ignore all words with a total freq lower 7. MAX_VOCAB_SIZE lmit RAM during vocab building; if more unique, prune infrequent ones. Every 10m word types need about 1 GB of RAM (None for no limit as default) 8. SAMPLE threshold for configuring which higer-freq words randomly downsampled; default 1e-3 useful (0, 1e-5) 9. WORKERS use threads to train (faster with multicore) 10. HS if 1, hierarchical softmax used else 0 negative is non-zero, negative sampling used 11. NEGATIVE if > 0, negative smaple 12. CBOW_MEAN if 0, use sum of context word vectors, 1 for mean 13. HASHFXN hash func use to randomly init weights, for rised training reproducibility - default is Python\s rudimentary hash func 14. ITER num of iterations or epochs over corpus default 5 15. TRIM_RULE vocab trimming rule discard if word count < min_cuount (If none, min_count used, or callable accpeting params like word, count and min_count, returns either utils.RULE_DISCARD, UTILS.RULE_KEEP OR UTILS.RULE_DEFAULT) NOTE if given, only used to prune during build_vocab and not stored as part of model 16. SORTED_VOCAB if 1 default sort desc before assigning index 17. BATCH_WORDS target size in words passed to worker threas default 10k - Notebook
- GENSIM TUtorial
- Training on generic corpus preferable - Text8 from Wiki
- GENSIM allows **similar API to download models using other word EMBEDDINGS
- Equipped to train, load models, use word embeddings to conduct experiments
# be sure to make appropriate imports and installations
from gensim.models import word2vec
sentences = word2vec.Text8Corpus('text8')
model = word2vec.Word2Vec(sentences, size=200, hs=1)
print(model)
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)[0]
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
model.wv['computer']model.save("text8_model")
model.save("text8_model")
model = word2vec.Word2Vec.load("text8_model")
model.wv.doesnt_match("breakfast cereal dinner lunch".split())
model.wv.similarity('woman', 'man')
model.wv.similarity('woman', 'cereal')
model.wv.distance('man', 'woman')
word_vectors = model.wv
del model
model.wv.evaluate_word_pairs(os.path.join(module_path, 'test_data','wordsim353.tsv'))
model.wv.accuracy(os.path.join(module_path, 'test_data', 'questions-words.txt'))
from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
Doc2Vec
- Extending Word2Vec with another vector paragraph ID
- One major diff about GENSIM is not expecting a simple corpus as intpu - algo expects TAGS or LABELS as part of input
gensim.models.doc2vec.LabeledSentence
# alternatively
gensim.models.doc2vec.TaggedDocument
sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
# in case of error, try
sentence = LabeledSentence(words=[u'some', u'words', u'here'], tags=[u'SENT_1'])
> Here`sentence` an example of what input resembles
# LEE corpus
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'
def read_corpus(file_name, tokens_only=False):
with smart_open.smart_open(file_name) as f:
for i, line in enumerate(f):
if tokens_only:
yield gensim.utils.simple_preprocess(line)
else:
# For training data, add tags
yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
models = [
# PV-DBOW
Doc2Vec(dm=0, dbow_words=1, vector_size=200, window=8, min_count=10, epochs=50),
# PV-DM w/average
Doc2Vec(dm=1, dm_mean=1, vector_size=200, window=8, min_count=10, epochs =50),
]
models[0].build_vocab(documents)
models[1].reset_from(models[0])
for model in models:
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
new_model = ConcatenatedDoc2Vec((models[0], models[1]))
inferred_vector = model.infer_vector(train_corpus[0].words)
sims = model.docvecs.most_similar([inferred_vector])
print(sims)
> In practice, no need to test for most similar vectors on training set - this is to illustrate
1. Note list of doc most similar to doc 0 ID 0 shows up first - more interesting to check 48th or 255th doc
Context captured perfectly by Doc2Vec, simply searched up the most similar doc - imagine the power it brings if used in tandem with clustering and classifying doc ! Instead of TF-IDF or TM as previously presented !!
Such is Vectorisation with SEMANTIC understanding both words and documents
Other Word Embeddings
-
GENSIM wraps most of popular methods
WORDRANK, VAREMBED, FASTTEXT, POINCARE EMBEDDINGS
-
Neat script to use GloVe embeddings useful in comparing between diff kinds of embeddings
-
KeyedVectors
class is base to use all word embeddgins -
Key to note is RUN
word_vectors = model.wv
AFTER done training model -
Also, continue using
word_vectors
for all tasks - for most similar words, most dissimilar and running tests for word embeddings - source code of KeyedVectors.py
GloVe
- Training done on aggregated global word-word co-occurrence stats from a corpus - like Word2Vec, using context to decipher and create word representations
- Developed by NLPL Stanford and paper worth reading as it illustrates some of the pitfalls of LSA and Word2Vec
- Many implementations and even in Python system - not training here but using (training need to tweek glove_python or just glove or look at source
- GENSIM
- download or train GloVe vectors - save - convert format to Word2Vec for futher usg in GENSIM API
- Download page
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)
from gensim.models import KeyedVectors
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
FastText
- Dev by Facebook AI, fast and efficient due to morphological details learning
- Unique in deriving word vectors for unknown owrds from morphological char of words, creating word vector for unseen
- Intriguing in some langugage, English e.g. ‘ly’
embedding(strang) - embedding(strangely) ~= embedding(charming) - embedding(charmingly)
- Performs better for structure or syntax, while Word2Vec for semantic tasks
- Notebook and Doc
- Possible to use C++ via wrapper
- Notebook1
from gensim.models.wrappers.fasttext import FastText
# Set FastText home to the path to the FastText executable
ft_home = '/home/bhargav/Gensim/fastText/fasttext'
# train the model
model_wrapper = FastText.train(ft_home, train_file)
print('dog' in model.wv.vocab)
print('dogs' in model.wv.vocab)
print('dog' in model)
print('dogs' in model)
WordRank
- Embedding as Ranking - similar to GloVe in using global co-occurences of words to generate
- code and github
- GENSIM API - beware of
dump_period
anditer
param needed to be sync as it dumps file with start of next iteration Tutorial - CAVEAT window size of 15 performed with optimum results, and 100 epochs is better than 500, quite long.
- GOOD COMPARISON betwee FastText, word2vec and WorkRank blog and Notebook
Varembed
- Like FastText it takes morphological info to generate word vectors
- Similar to GloVe, cannot update model with new words and need to train a new model code
from gensim.models.wrappers import varembed
varembed_vectors = '../../gensim/test/test_data/varembed_leecorpus_vectors.pkl'
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_vectors)
morfessors = '../../gensim/test/test_data/varembed_leecorpus_morfessor.bin'
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_vectors, morfessor_model=morfessors)
Poincare
- Also dev by FB - using graphical repr of words to decipher relationship between words to generate vector
- Also captures hiearchical info computed by hyperbolic space not traditioanl Norm2 allowing for hierarchy info
- Poincaré Embeddings for Learning Hiearchical Represenstaions
import os
poincare_directory = os.path.join(os.getcwd(), 'docs', 'notebooks', 'poincare')
data_directory = os.path.join(poincare_directory, 'data')
wordnet_mammal_file = os.path.join(data_directory, 'wordnet_mammal_hypernyms.tsv')
# Training process
from gensim.models.poincare import PoincareModel, PoincareKeyedVectors, PoincareRelations
relations = PoincareRelations(file_path=wordnet_mammal_file, delimiter='\t')
model = PoincareModel(train_data=relations, size=2, burn_in=0)
model.train(epochs=1, print_every=500)
# Also use own iterable of relations to train model
# each relation is just a pair of nodes
# GENSIM also has pre-trained models as follows
models_directory = os.path.join(poincare_directory, 'models')
test_model_path = os.path.join(models_directory, 'gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50')
model = PoincareModel.load(test_model_path)
Deep Learning for Text
Generating Text
import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import numpy as np
# Any text source as input based on what kind of data to generate
# CREATEIVE! HERE - RNN to write poetry if enough data
# Need to generate MAPPING of all distinct characters inf book (LSTM is char-level model)
filename = 'data/source_data.txt'
data = open(filename).read()
data = data.lower()
# Find all the unique characters
chars = sorted(list(set(data)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
ix_to_char = dict((i, c) for i, c in enumerate(chars))
vocab_size = len(chars)
# The 2 dicts pass char to model and in generating
# RNN accepts seq of char as input and ouput such similar seq - to break up into seq
seq_length = 100
list_X = [ ]
list_Y = [ ]
for i in range(0, len(chars) - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
list_X.append([char_to_int[char] for char in seq_in])
list_Y.append(char_to_int[seq_out])
n_patterns = len(dataX)
X = np.reshape(list_X, (n_patterns, seq_length, 1))
# Encode output as one-hot vector
Y = np_utils.to_categorical(list_Y)
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Dropout to control overfitting
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(X, y, epochs=20, batch_size=128, callbacks=callbacks_list)
# callback save weights to fiile at whenever improvement
# OR transfer learning
filename = "weights.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# generate text start text randomly
start = np.random.randint(0, len(X) - 1)
pattern = np.ravel(X[start]).tolist()
output = []
for i in range(250):
x = np.reshape(pattern, (1, len(pattern), 1))
x = x / float(vocab_size)
prediction = model.predict(x, verbose = 0)
index = np.argmax(prediction)
result = index
output.append(result)
pattern.append(index)
pattern = pattern[1 : len(pattern)]
print ("\"", ''.join([ix_to_char[value] for value in output]), "\"")
# IDEA: based on X input, choose highest porba for next char using argmax, convert that index to a char, append it to output list, loops = iterations in output
Keras and SpaCy for DL
Keras and SpaCy
- Keras Submodule Text Preprocess
- KERAS KEN
SpaCy
TextCategorizer
trains similar to other components as POS and NER, also integrating wiht other word embeddings such as GENSIM Word2Vec or GloVe, plus plug-in to Keras model; Combine SpaCy and Keras allows powerful classification machien
Classification with Keras
- Small dataset such as IMDB might get better result using simply BOW + SVM
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
Notes 1. Not using text preprocess moduels as IMDB dataset already cleaned 2. LSTM for classification, a variant of RNN 3. LSTM is mere Layer inside
Sequential
model
max_features = 20000
maxlen = 80 # cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
max_features
refers to top words wishe to use limited to 20k words; similar to ridding of least used words;maxlen
used for fix length as NN accpets a FIED LEN input;batch_size
used later to specify batches trained
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
Setup Seq-model, stacked on layers of word-embeddings (20k features), dropped down to 128 Option to use other embedders - LSTM 128 number of Dimensions
# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print('Train...')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=15,
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
- CNN need a few more params to tune
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Conv1D, MaxPooling1D
from keras.datasets import imdb
# Convolution
kernel_size = 5
filters = 64
pool_size = 4
# Embedding
max_features = 20000
maxlen = 100
embedding_size = 128
# LSTM
lstm_output_size = 70
# Training
batch_size = 30
epochs = 2
Above params affects training heavily and are empirically derived after experiemtns
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(filters,
kernel_size,
padding='valid',
activation='relu',
strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(1))
model.add(Activation('sigmoid'))
7 payers, Pooling layer to progressively reduce spatial size to reduce params hence controlling overfitting;
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print('Train...')
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
Below use pretrained word embeddings in classifier to improve results
BASE_DIR = ''
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
# using preceding var/arg to load word embeddgins
print('Indexing word vectors.')
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))
# Simple loop through files
print('Preparing embedding matrix.')
# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i >= MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
# Make sure set training argument to false so to use word vectors as is
embedding_layer = Embedding(num_words,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
print('Training model.')
# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
# Layers stacked differently with x var holding each layer
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.fit(x_train, y_train,
batch_size=128,
epochs=10,
validation_data=(x_val, y_val))
Here used different measure for calcu loss; above illustrates basic LSTM, a CNN and a CNN using pretrained word embeddings 1. See progressive rise in performance of each networkds 2. Embeddings are esp. useful when not much data 3. CNN generally perform better than Sequential, those using word-embedding even better 4. Useful to train and compare with Non-NN model such as NB or SVM
Classification with SpaCy
- While keras works esp. well in standalone text classification, it might be useuful to use Keras plus spaCy
- 2 Ways to Text Classification in SpaCy
- Own NN library THINC
- Keras
- Example 1 code
Set up 1. This example shows how to use an LSTM sentiment classification model trained using Keras in spaCy. spaCy splits the document into sentences, and each sentence is classified using the LSTM. The scores for the sentences are then aggregated to give the document score. 2. This kind of hierarchical model is quite difficult in “pure” Keras or Tensorflow, but it’s very effective. The Keras example on this dataset performs quite poorly, because it cuts off the documents so that they’re a fixed size. This hurts review accuracy a lot, because people often summarise their rating in the final sentence 3. Prerequesit: spacy download en_vectors_web_lg / pip install keras==2.0.9 / Compatible with: spaCy v2.0.0+
import plac
import random
import pathlib
import cytoolz
import numpy
from keras.models import Sequential, model_from_json
from keras.layers import LSTM, Dense, Embedding, Bidirectional
from keras.layers import TimeDistributed
from keras.optimizers import Adam
import thinc.extra.datasets
from spacy.compat import pickle
import spacy
class SentimentAnalyser(object):
@classmethod
def load(cls, path, nlp, max_length=100):
with (path / 'config.json').open() as file_:
model = model_from_json(file_.read())
with (path / 'model').open('rb') as file_:
lstm_weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab)
model.set_weights([embeddings] + lstm_weights)
return cls(model, max_length=max_length)
def __init__(self, model, max_length=100):
self._model = model
self.max_length = max_length
def __call__(self, doc):
X = get_features([doc], self.max_length)
y = self._model.predict(X)
self.set_sentiment(doc, y)
# set up class and how to load model and embedding weights
# INIT model, max length, instructions to predict
# Load method returns model to use in eval
# call gets features and pred
def pipe(self, docs, batch_size=1000, n_threads=2):
for minibatch in cytoolz.partition_all(batch_size, docs):
minibatch = list(minibatch)
sentences = []
for doc in minibatch:
sentences.extend(doc.sents)
Xs = get_features(sentences, self.max_length)
ys = self._model.predict(Xs)
for sent, label in zip(sentences, ys):
sent.doc.sentiment += label - 0.5
for doc in minibatch:
yield doc
def set_sentiment(self, doc, y):
doc.sentiment = float(y[0])
# Sentiment has a native slot for a single float.
# For arbitrary data storage, there's:
# doc.user_data['my_data'] = y
#
def get_labelled_sentences(docs, doc_labels):
labels = []
sentences = []
for doc, y in zip(docs, doc_labels):
for sent in doc.sents:
sentences.append(sent)
labels.append(y)
return sentences, numpy.asarray(labels, dtype='int32')
def get_features(docs, max_length):
docs = list(docs)
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
for i, doc in enumerate(docs):
j = 0
for token in doc:
vector_id = token.vocab.vectors.find(key=token.orth)
if vector_id >= 0:
Xs[i, j] = vector_id
else:
Xs[i, j] = 0
j += 1
if j >= max_length:
break
return Xs
# Below is where all heavy lifting place - NOTE lines involving spacy's pipeline of sentencizer
def train(train_texts, train_labels, dev_texts, dev_labels,
lstm_shape, lstm_settings, lstm_optimizer, batch_size=100,
nb_epoch=5, by_sentence=True):
print("Loading spaCy")
nlp = spacy.load('en_vectors_web_lg')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
embeddings = get_embeddings(nlp.vocab)
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
print("Parsing texts...")
train_docs = list(nlp.pipe(train_texts))
dev_docs = list(nlp.pipe(dev_texts))
if by_sentence:
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
train_X = get_features(train_docs, lstm_shape['max_length'])
dev_X = get_features(dev_docs, lstm_shape['max_length'])
model.fit(train_X, train_labels, validation_data=(dev_X, dev_labels),
nb_epoch=nb_epoch, batch_size=batch_size)
return model
# Like previously, set up each layers and stack up
# Any Keras model would do, here bidriectional LSTM
def compile_lstm(embeddings, shape, settings):
model = Sequential()
model.add(
Embedding(
embeddings.shape[0],
embeddings.shape[1],
input_length=shape['max_length'],
trainable=False,
weights=[embeddings],
mask_zero=True
)
)
model.add(TimeDistributed(Dense(shape['nr_hidden'], use_bias=False)))
model.add(Bidirectional(LSTM(shape['nr_hidden'],
recurrent_dropout=settings['dropout'],
dropout=settings['dropout'])))
model.add(Dense(shape['nr_class'], activation='sigmoid'))
model.compile(optimizer=Adam(lr=settings['lr']), loss='binary_crossentropy',
metrics=['accuracy'])
return model
# Eval method retunrs a score of how well model performed; checks assigned sentiment score with label of document
def get_embeddings(vocab):
return vocab.vectors.data
def evaluate(model_dir, texts, labels, max_length=100):
def create_pipeline(nlp):
'''
This could be a lambda, but named functions are easier to read in Python.
'''
return [nlp.tagger, nlp.parser, SentimentAnalyser.load(model_dir, nlp,
max_length=max_length)]
nlp = spacy.load('en')
nlp.pipeline = create_pipeline(nlp)
correct = 0
i = 0
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):
correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
i += 1
return float(correct) / i
# Using IMBD sentiment analysis datteset, this method is an API to access data
def read_data(data_dir, limit=0):
examples = []
for subdir, label in (('pos', 1), ('neg', 0)):
for filename in (data_dir / subdir).iterdir():
with filename.open() as file_:
text = file_.read()
examples.append((text, label))
random.shuffle(examples)
if limit >= 1:
examples = examples[:limit]
return zip(*examples) # Unzips into two lists
# Annotations set up options setting various model directories, runtime, and params
@plac.annotations(
train_dir=("Location of training file or directory"),
dev_dir=("Location of development file or directory"),
model_dir=("Location of output model directory",),
is_runtime=("Demonstrate run-time usage", "flag", "r", bool),
nr_hidden=("Number of hidden units", "option", "H", int),
max_length=("Maximum sentence length", "option", "L", int),
dropout=("Dropout", "option", "d", float),
learn_rate=("Learn rate", "option", "e", float),
nb_epoch=("Number of training epochs", "option", "i", int),
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
nr_examples=("Limit to N examples", "option", "n", int)
)
# Now the main functions
def main(model_dir=None, train_dir=None, dev_dir=None,
is_runtime=False,
nr_hidden=64, max_length=100, # Shape
dropout=0.5, learn_rate=0.001, # General NN config
nb_epoch=5, batch_size=100, nr_examples=-1): # Training params
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if train_dir is None or dev_dir is None:
imdb_data = thinc.extra.datasets.imdb()
if is_runtime:
if dev_dir is None:
dev_texts, dev_labels = zip(*imdb_data[1])
else:
dev_texts, dev_labels = read_data(dev_dir)
acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length)
print(acc)
else:
if train_dir is None:
train_texts, train_labels = zip(*imdb_data[0])
else:
print("Read data")
train_texts, train_labels = read_data(train_dir, limit=nr_examples)
if dev_dir is None:
dev_texts, dev_labels = zip(*imdb_data[1])
else:
dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples)
train_labels = numpy.asarray(train_labels, dtype='int32')
dev_labels = numpy.asarray(dev_labels, dtype='int32')
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
{'nr_hidden': nr_hidden, 'max_length': max_length, 'nr_class': 1},
{'dropout': dropout, 'lr': learn_rate},
{},
nb_epoch=nb_epoch, batch_size=batch_size)
weights = lstm.get_weights()
if model_dir is not None:
with (model_dir / 'model').open('wb') as file_:
pickle.dump(weights[1:], file_)
with (model_dir / 'config.json').open('wb') as file_:
file_.write(lstm.to_json())
if __name__ == '__main__':
plac.call(main)
First few lines set up model folder and load dataset / then check print run time info, if not training is not complete proceeding to train, train and save model 1. Running, saving using model in pipelines is huge motif behind Keras and SpaCy in such a way 2. KEY here updating
sentiment
attri for each doc (how is optional) 3. SpaCy GOOD at not removing or truncating input - as users tend to sum up review in last sentence of documents with a lot of sentiment inferred 4. HOW to USE? model adds one more attribute to doc,doc.sentiment
caputuring 5. Verify by loading saved model and run any document through pipelien the same way through prevous POS, NER and Dep-parsingdoc = nlp(document)
6.nlp
is pipeline obj of loaded model trained,docuemnt
any unicode text wish to analyse
- Non-NN classifier
- Example below to run at once
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import thinc.extra.datasets
import spacy
from spacy.util import minibatch, compounding
# Not Keras, but SpaCy's THINC
@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_texts=("Number of texts to train from", "option", "t", int),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
textcat = nlp.get_pipe('textcat')
# add label to text classifier
textcat.add_label('POSITIVE')
# load the IMDB dataset
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
print("Using {} examples ({} training, {} evaluation)"
.format(n_texts, len(train_texts), len(dev_texts)))
train_data = list(zip(train_texts,
[{'cats': cats} for cats in train_cats]))
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes): # only train textcat
optimizer = nlp.begin_training()
print("Training the model...")
print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 32., 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
losses=losses)
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}' # print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))
# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)
# test model with eval method, calcu precision, recall F-score, last part saving trained model in output dir
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ = thinc.extra.datasets.imdb()
random.shuffle(train_data)
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])
def evaluate(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives
tn = 1e-8 # True negatives
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
if label not in gold:
continue
if score >= 0.5 and gold[label] >= 0.5:
tp += 1.
elif score >= 0.5 and gold[label] < 0.5:
fp += 1.
elif score < 0.5 and gold[label] < 0.5:
tn += 1
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_score = 2 * (precision * recall) / (precision + recall)
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}
if __name__ == '__main__':
plac.call(main)
Final methods similar to before in MAIN fucn; one is to load data, other to eval; return data appropriately shuffled and split, eval func calcu true neatves, TP, FN and FP to create confusion matrix
test_text = "This movie disappointed me severely"
doc = nlp(test_text)
print(test_text, doc.cats)
doc.cats
gives result of classifcation, i.e. negative sentiments
Such is final setp - test model on sample, also see one of main pros of spaCy for DL - fits seamlessly in PIPELINE, and classifaction or sentiment score ends up being antoehr attribute of document - quite differente to Keras, whose purpose is to EITHER generating text OR to output proba-Vectors (vector in vector out); Possible to leverage this info as part of text analysis pipeline BUT spaCy does training under hood and learns attributes to doc makes easy to include info as part of any text analysis PIPELINE
Ideas for Project
Reddit Sense2Vec SpaCy
- Code Semantic analysis modifiable in source data and Semantics and web app! Visualisation
Twitter Mining
Chatbot
- A Neural Conversational Model - Vinyal and Lee
- Production-grade API RASA NLU and ChatterBot
- RASA JSON-data Example
adding more entites and intent, model laerns more context better decipher questions - one of backend is SpaCy and SKL 1. UnderHood, Word2Vec for intent, spaCy clean up text, SLK build models - detail 2. One of which involves being able to write own parts of bot instead of API 3. JSON entry/data to train RASA (see elsewhere)
- Front-end and Tutorial
- Non-AI but Learn-Concept at Chatbot Fundamentals and Brobot
- Recall the Concept of chatbot
- Take Input
- Classify intent (question, statement, greeting)
- If greeting -
- If Question (query simialr questions from dataset, do sentence analysis, response e.e. based on Reddit/Food or Twitter conversion)
- If statement/conversion (generative model)
- Closure
Polyglot NER polygplot
- Vector word
- Why? main is language…. over 130
- e.g. transliteration
- practice Spanish NER with polyglot
- auto-detect once init ‘langue’
from polyglot.text import Text
txt = Text(article_uber)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-154-47870b8e6243> in <module>()
----> 1 from polyglot.text import Text
2
3 txt = Text(article_uber)
~/anaconda3/lib/python3.6/site-packages/polyglot/text.py in <module>()
7
8 from polyglot.base import Sequence, TextFile, TextFiles
----> 9 from polyglot.detect import Detector, Language
10 from polyglot.decorators import cached_property
11 from polyglot.downloader import Downloader
~/anaconda3/lib/python3.6/site-packages/polyglot/detect/__init__.py in <module>()
----> 1 from .base import Detector, Language
2
3 __all__ = ['Detector', 'Language']
~/anaconda3/lib/python3.6/site-packages/polyglot/detect/base.py in <module>()
9
10
---> 11 from icu import Locale
12 import pycld2 as cld2
13
ModuleNotFoundError: No module named 'icu'
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)
# Print each of the entities found
for ent in txt.entities:
print(ent)
# Print the type of ent
print(type(ent))
# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]
# Print entities
print(entities)
Spanish NER
# Initialize the count variable: count
count = 0
# Iterate over all the entities
for ent in txt.entities:
# Check whether the entity contains 'Márquez' or 'Gabo'
if "Márquez" in ent or "Gabo" in ent:
# Increment count
count += 1
# Print count
print(count)
# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)
SL with NLP
- Classification of fake news
- Use language au lieu de Features
- Creating train data from text
- BOW or tf-idf as feature
IMDB Movie Example
- Plot text data
- Type of Film MultiCAT
- Target: Predict genre by plot
Possible Features in Text-Classification
- Frequency (BOW or tf-idf)
- Topic (Named Entities)
- Language
df_movie = pd.read_csv('Data_Folder/TxT/fake_or_real_news.csv')
df_movie.head()
df_movie.label[:5]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
y = df_movie.label
X_train, X_test, y_train, y_test = train_test_split(df_movie.text, y, test_size=0.33, random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)
print(count_vectorizer.get_feature_names()[:10])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 FAKE
1 FAKE
2 REAL
3 FAKE
4 REAL
Name: label, dtype: object
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']
# similar to sparse CountVectorizer, create tf-idf vectors
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
print(tfidf_vectorizer.get_feature_names()[:10])
print(tfidf_train.A[:5])
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
# some inspection
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
print(count_df.head(), '\n', tfidf_df.head(), '\n')
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference, '\n')
print(count_df.equals(tfidf_df))
00 000 0000 00000031 000035 00006 0001 0001pt 000ft 000km ... \
0 0 0 0 0 0 0 0 0 0 0 ...
1 0 0 0 0 0 0 0 0 0 0 ...
2 0 0 0 0 0 0 0 0 0 0 ...
3 0 0 0 0 0 0 0 0 0 0 ...
4 0 0 0 0 0 0 0 0 0 0 ...
حلب عربي عن لم ما محاولات من هذا والمرضى ยงade
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
[5 rows x 56922 columns]
00 000 0000 00000031 000035 00006 0001 0001pt 000ft 000km ... \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
حلب عربي عن لم ما محاولات من هذا والمرضى ยงade
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
[5 rows x 56922 columns]
set()
False
Testing using Naive Bayes Classifier
- NB Model commonly used for testing NLP classificaiton problems
- basis in probability
- likelihood estimation
- Conditional Probability of token
- Not working well with float like tf-idf
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print(score)
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
0.893352462936394
[[ 865 143]
[ 80 1003]]
# NB on tf-idf
nb_classifier_tfidf = MultinomialNB()
nb_classifier_tfidf.fit(tfidf_train, y_train)
pred_tfidf = nb_classifier.predict(tfidf_test)
score_tfidf = metrics.accuracy_score(y_test, pred)
print(score_tfidf)
cm_tfidf = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm_tfidf)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
0.893352462936394
[[ 865 143]
[ 80 1003]]
Simple NLP, Complex Problems
- Translation, grammar in languages
- Word complexity in other languages
- Sentiment shift
- Semantics
- Custom in gender, meaning, etc
alphas = np.arange(0, 1, .1)
def train_and_predict(alpha):
nb_classifier = MultinomialNB(alpha=alpha)
nb_classifier.fit(tfidf_train, y_train)
pred = nb_classifier.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
return score
for alpha in alphas:
print('Alpha: ', alpha)
print('Score: ', train_and_predict(alpha))
print()
Alpha: 0.0
Score: 0.8813964610234337
Alpha: 0.1
Score: 0.8976566236250598
Alpha: 0.2
Score: 0.8938307030129125
Alpha: 0.30000000000000004
Score: 0.8900047824007652
Alpha: 0.4
/Users/Ocean/anaconda3/lib/python3.6/site-packages/sklearn/naive_bayes.py:472: UserWarning: alpha too small will result in numeric errors, setting alpha = 1.0e-10
'setting alpha = %.1e' % _ALPHA_MIN)
Score: 0.8857006217120995
Alpha: 0.5
Score: 0.8842659014825442
Alpha: 0.6000000000000001
Score: 0.874701099952176
Alpha: 0.7000000000000001
Score: 0.8703969392635102
Alpha: 0.8
Score: 0.8660927785748446
Alpha: 0.9
Score: 0.8589191774270684
class_labels = nb_classifier.classes_
feature_names = tfidf_vectorizer.get_feature_names()
# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))
# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20], '\n')
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])
FAKE [(-13.817639290604365, '0000'), (-13.817639290604365, '000035'), (-13.817639290604365, '0001'), (-13.817639290604365, '0001pt'), (-13.817639290604365, '000km'), (-13.817639290604365, '0011'), (-13.817639290604365, '006s'), (-13.817639290604365, '007'), (-13.817639290604365, '007s'), (-13.817639290604365, '008s'), (-13.817639290604365, '0099'), (-13.817639290604365, '00am'), (-13.817639290604365, '00p'), (-13.817639290604365, '00pm'), (-13.817639290604365, '014'), (-13.817639290604365, '015'), (-13.817639290604365, '018'), (-13.817639290604365, '01am'), (-13.817639290604365, '020'), (-13.817639290604365, '023')]
REAL [(-6.172241591175732, 'republicans'), (-6.126896127062493, 'percent'), (-6.115534950553315, 'political'), (-6.067024557833956, 'house'), (-5.9903983888515535, 'like'), (-5.986816295469049, 'just'), (-5.97418288622825, 'time'), (-5.964034477506528, 'states'), (-5.949002396420198, 'sanders'), (-5.844483857160232, 'party'), (-5.728156816243612, 'republican'), (-5.63452121120962, 'campaign'), (-5.5727798946931095, 'new'), (-5.515621480853161, 'state'), (-5.511414074572205, 'obama'), (-5.482207812723569, 'president'), (-5.455931002028523, 'people'), (-4.98170150128453, 'clinton'), (-4.5936919152219655, 'trump'), (-4.477148234163137, 'said')]
ChatBot
bot_template = "BOT : {0}"
user_template = "USER : {0}"
# Define a function that responds to a user's message: respond
def respond(message):
# Concatenate the user's message to the end of a standard bot respone
bot_message = "I can hear you! You said: " + message
# Return the result
return message
import time
# Define a function that sends a message to the bot: send_message
def send_message(message):
# Print user_template including the user_message
print(user_template.format(message))
# Get the bot's response to the message
response = respond(message)
time.sleep(1.5) # artificial delay minicking natural
print(bot_template.format(response))
# Send a message to the bot
send_message("hello")
send_message("What did I say?")
USER : hello
BOT : hello
USER : What did I say?
BOT : What did I say?
Personification
- chatbot not command line
- fun and use
- python module Smaltalk
- Simple: dict{key:response} , exact match-only
- Variable: response = dict{options}
- Asking Questions: input
import random
name = "Greg"
weather = "cloudy"
# Define a dictionary containing a list of responses for each message
responses = {
"what's your name?": [
"my name is {0}".format(name),
"they call me {0}".format(name),
"I go by {0}".format(name)
],
"what's today's weather?": [
"the weather is {0}".format(weather),
"it's {0} today".format(weather)
],
"default": ["default message"]
}
# Use random.choice() to choose a matching response
def respond(message):
# Check if the message is in the responses
if message in responses:
# Return a random matching response
bot_message = random.choice(responses[message])
else:
# Return a random "default" response
bot_message = random.choice(responses["default"])
return bot_message
import random
responses_question = {'question': ["I don't know :(",
'you tell me!'],
'statement':
['tell me more!', 'why do you think that?',
'how long have you felt this way?',
'I find that extremely interesting',
'can you back that up?','oh wow!',
':)']}
def respond(message):
# Check for a question mark
if message.endswith("?"):
# Return a random question
return random.choice(responses_question["question"])
# Return a random statement
return random.choice(responses_question["statement"])
# Send messages ending in a question mark
send_message("what's today's weather?")
send_message("what's today's weather?")
# Send messages which don't end with a question mark
send_message("I love building chatbots")
send_message("I love building chatbots")
USER : what's today's weather?
BOT : I don't know :(
USER : what's today's weather?
BOT : I don't know :(
USER : I love building chatbots
BOT : I find that extremely interesting
USER : I love building chatbots
BOT : tell me more!
Using Regex to Match Pattern and Respond
import re
rules = {'I want (.*)':
['What would it mean if you got {0}',
'Why do you want {0}',
"What's stopping you from getting {0}"],
'do you remember (.*)':
['Did you think I would forget {0}',
"Why haven't you been able to forget {0}",
'What about {0}',
'Yes .. and?'],
'do you think (.*)':
['if {0}? Absolutely.', 'No chance'],
'if (.*)':
["Do you really think it's likely that {0}",
'Do you wish that {0}',
'What do you think about {0}',
'Really--if {0}']}
def match_rule(rules, message):
response, phrase = "default", None
for pattern, responses in rules.items():
match = re.search(pattern, message)
if match is not None:
response = random.choice(responses)
if '{0}' in response:
phrase = match.group(1)
return response, phrase
print(match_rule(rules, "nice"))
('default', None)
# Grammar, Pronouns change
def replace_pronouns(message):
message = message.lower()
if 'me' in message:
return re.sub('me', 'you', message)
if 'my' in message:
return re.sub('my', 'your', message)
if 'your' in message:
return re.sub('your', 'my', message)
if 'you' in message:
return re.sub('you', 'me', message)
return message
print(replace_pronouns("my last birthday"))
print(replace_pronouns("when you went to Florida"))
print(replace_pronouns("I had my own castle"))
your last birthday
when me went to florida
i had your own castle
# Combining previous two functions
def respond(message):
response, phrase = match_rule(rules, message)
if '{0}' in response:
phrase = replace_pronouns(phrase)
# Include the phrase in the response
response = response.format(phrase)
return response
send_message("do you remember your last birthday")
send_message("do you think humans should be worried about AI")
send_message("I want a robot friend")
send_message("what if you could be anything you wanted")
USER : do you remember your last birthday
BOT : What about my last birthday
USER : do you think humans should be worried about AI
BOT : if humans should be worried about ai? Absolutely.
USER : I want a robot friend
BOT : What's stopping you from getting a robot friend
USER : what if you could be anything you wanted
BOT : Really--if me could be anything me wanted
NLU - Sub-NLP
- Logic Loop
- “I’m look for a Mexico restaurant in the center of town”
- NLU model
- intent - restaurant_search
- entities - cuisine: Mexican, area: center
- Database
- “Sure! what about Pepe’s Buritos on Main St?”
keywords = {'goodbye': ['bye', 'farewell'],
'greet': ['hello', 'hi', 'hey'],
'thankyou': ['thank', 'thx']}
patterns = {}
for intent, keys in keywords.items():
patterns[intent] = re.compile('|'.join(keys))
print(patterns)
{'goodbye': re.compile('bye|farewell'), 'greet': re.compile('hello|hi|hey'), 'thankyou': re.compile('thank|thx')}
# Define a function to find the intent of a message
def match_intent(message):
matched_intent = None
for intent, pattern in patterns.items():
if pattern.search(message):
matched_intent = intent
return matched_intent
responses_intent = {'default': 'default message',
'goodbye': 'goodbye for now',
'greet': 'Hello you! :)',
'thankyou': 'you are very welcome'}
def respond(message):
intent = match_intent(message)
key = "default"
if intent in responses_intent:
key = intent
return responses_intent[key]
send_message("hello!")
send_message("bye byeee")
send_message("thanks very much!")
USER : hello!
BOT : Hello you! :)
USER : bye byeee
BOT : goodbye for now
USER : thanks very much!
BOT : you are very welcome
def find_name(message):
name = None
# Create a pattern for checking if the keywords occur
name_keyword = re.compile("name|call")
# Create a pattern for finding capitalized words
name_pattern = re.compile('[A-Z]{1}[a-z]*')
if name_keyword.search(message):
# Get the matching words in the string
name_words = name_pattern.findall(message)
if len(name_words) > 0:
# Return the name if the keywords are present
name = ' '.join(name_words)
return name
def respond(message):
name = find_name(message)
if name is None:
return "Hi there!"
else:
return "Hello, {0}!".format(name)
send_message("my name is David Copperfield")
send_message("call me Ishmael")
send_message("People call me Cassandra")
USER : my name is David Copperfield
BOT : Hello, David Copperfield!
USER : call me Ishmael
BOT : Hello, Ishmael!
USER : People call me Cassandra
BOT : Hello, People Cassandra!
ML in Chatbot
- Predict(Intent)
- Many ways of Vectorisation of text
- Here word-vector to repr meaning of similar context
- Open source of high quality word-vectors
SpaCy module
- word-vector hundreds of elements
- Similarity measured by angle between vectors distance
Cosine Similarity
- 1 if same direction
- 0 if orthogonal
- -1 if opposite
# dataset: flight booking system interaction from ATIS
sentences_demo = [' i want to fly from boston at 838 am and arrive in denver at 1110 in the morning',
' what flights are available from pittsburgh to baltimore on thursday morning',
' what is the arrival time in san francisco for the 755 am flight leaving washington',
' cheapest airfare from tacoma to orlando',
' round trip fares from pittsburgh to philadelphia under 1000 dollars',
' i need a flight tomorrow from columbus to minneapolis',
' what kind of aircraft is used on a flight from cleveland to dallas',
' show me the flights from pittsburgh to los angeles on thursday',
' all flights from boston to washington',
' what kind of ground transportation is available in denver',
' show me the flights from dallas to san francisco',
' show me the flights from san diego to newark by way of houston',
' what is the cheapest flight from boston to bwi',
' all flights to baltimore after 6 pm',
' show me the first class fares from boston to denver',
' show me the ground transportation in denver',
' all flights from denver to pittsburgh leaving after 6 pm and before 7 pm',
' i need information on flights for tuesday leaving baltimore for dallas dallas to boston and boston to baltimore',
' please give me the flights from boston to pittsburgh on thursday of next week',
' i would like to fly from denver to pittsburgh on united airlines',
' show me the flights from san diego to newark',
' please list all first class flights on united from denver to baltimore',
' what kinds of planes are used by american airlines',
" i'd like to have some information on a ticket from denver to pittsburgh and atlanta",
" i'd like to book a flight from atlanta to denver",
' which airline serves denver pittsburgh and atlanta',
" show me all flights from boston to pittsburgh on wednesday of next week which leave boston after 2 o'clock pm",
' atlanta ground transportation',
' i also need service from dallas to boston arriving by noon',
' show me the cheapest round trip fare from baltimore to dallas']
import spacy
nlp = spacy.load('en_core_web_lg') # 1G size of library
n_sentences_demo = len(sentences_demo)
embedding_dim = nlp.vocab.vectors_length
X = np.zeros((n_sentences_demo, embedding_dim))
for idx, sentence in enumerate(sentences_demo):
doc = nlp(sentence) # pass each sent to nlp obj
X[idx, :] = doc.vector # pass .vector attr to row in X
From Word-Vector to ML Intent and Classification
- ‘msg’ to intent recognition
- KNN is simple, Support Vector is better
df_intent_train = pd.read_csv('Data_Folder/atis/atis_intents_train.csv', header=None)
df_intent_test = pd.read_csv('Data_Folder/atis/atis_intents_test.csv', header=None)
intent_X_train, intent_y_train = df_intent_train[:][1], df_intent_train[:][0]
intent_X_test, intent_y_test = df_intent_test[:][1], df_intent_test[:][0]
X_train = np.zeros((len(intent_X_train), nlp.vocab.vectors_length))
for idx, sent in enumerate(intent_X_train):
X_train[idx,:] = nlp(sent).vector
# long time......
X_test = np.zeros
# simple method using label as intent via cosine similarity in sent-vectors
from sklearn.metrics.pairwise import cosine_similarity
X_test = nlp(intent_X_test[0][1]).vector
scores = [cosine_similarity(X_train[i,:].reshape(-1,1), X_test.reshape(-1,1)) for i in range(len(intent_X_train))]
intent_y_train[np.argmax(scores)]
'atis_flight'
# using Full X_test vectors for SVM
X_test = np.zeros((len(intent_X_test), nlp.vocab.vectors_length))
for idx, sent in enumerate(intent_X_test):
X_test[idx,:] = nlp(sent).vector
# SVC
from sklearn.svm import SVC
clf = SVC(C=1)
clf.fit(X_train, intent_y_train)
y_pred = clf.predict(X_test)
n_correct = 0
for i in range(len(intent_y_test)):
if y_pred[i] == intent_y_test[i]:
n_correct += 1
print("Predicted {0} correctly out of {1} test examples".format(n_correct, len(intent_y_test)))
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Predicted 689 correctly out of 800 test examples
Entity Extraction
- Unseen ER is tricky
- Generalisation - pattern and contextual cues:
- spelling
- capitalisation
- sequence of word pairs
- Pre-built NER
- places, dates, orgs, etc
- Roles
- xxx from x to x ; xxx to x from x
re.compile('.* from (.*) to (.*)')
re.compile('.* to (.*) from (.*)')
Dependency Parsing is complex topic
- spacy attr of token
token.ancestors
for parent token of word token
# Define included entities
include_entities = ['DATE', 'ORG', 'PERSON']
# Define extract_entities()
def extract_entities(message):
ents = dict.fromkeys(include_entities) # useful dict.fromkeys func
doc = nlp(message)
for ent in doc.ents:
if ent.label_ in include_entities:
# Save interesting entities
ents[ent.label_] = ent.text
return ents
print(extract_entities('friends called Mary who have worked at Google since 2010'))
print(extract_entities('people who graduated from MIT in 1999'))
{'DATE': '2010', 'ORG': 'Google', 'PERSON': 'Mary'}
{'DATE': '1999', 'ORG': 'MIT', 'PERSON': None}
# SpaCy's powerful syntax parser to assign ROLES to entities in message
doc = nlp("let's see that jacket in red and some blue jeans")
colors = ['black', 'red', 'blue']
items = ['shoes', 'handback', 'jacket', 'jeans']
def entity_type(word):
_type = None
if word.text in colors:
_type = "color"
elif word.text in items:
_type = "item"
return _type
# Iterate over parents in parse tree until an item entity is found
def find_parent_item(word):
# Iterate over the word's ancestors
for parent in word.ancestors:
# Check for an "item" entity
if entity_type(parent) == "item":
return parent.text
return None
# For all color entities, find their parent item
def assign_colors(doc):
# Iterate over the document
for word in doc:
# Check for "color" entities
if entity_type(word) == "color":
# Find the parent
item = find_parent_item(word)
print("item: {0} has color : {1}".format(item, word))
# Assign the colors
assign_colors(doc)
item: jacket has color : red
item: jeans has color : blue
Robust NLU with Rasa
- high-level API for intent recogition & entity extraction
- Based on spaCy, scikit-learn, other lib
- Built-in support for chatbot specific tasks
- data json file
Special - predicting typo or unseen word
'intent_featurizer_ngrams'
: predictive of ngram in sub-vectors- Ensure such config is required in context, i.e. enough data of such for learning
# Import necessary modules
from rasa_nlu.converters import load_data
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer
# Create args dictionary
args = {"pipeline": "spacy_sklearn"}
# Create a configuration and trainer
config = RasaNLUConfig(cmdline_args=args)
trainer = Trainer(config)
# Load the training data
training_data = load_data("./training_data.json")
# Create an interpreter by training the model
interpreter = trainer.train(training_data)
# Try it out
print(interpreter.parse("I'm looking for a Mexican restaurant in the North of town"))
Rasa
# Import necessary modules
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer
pipeline = [
"nlp_spacy",
"tokenizer_spacy",
"ner_crf"
]
# Create a config that uses this pipeline
config = RasaNLUConfig(cmdline_args={"pipeline": pipeline})
# Create a trainer that uses this config
trainer = Trainer(config)
# Create an interpreter by training the model
interpreter = trainer.train(training_data)
# Parse some messages
print(interpreter.parse("show me Chinese food in the centre of town"))
print(interpreter.parse("I want an Indian restaurant in the west"))
print(interpreter.parse("are there any good pizza places in the center?"))
Virtual Assistance
- accessing SQL
- bad practice to SQL injection, using python syntax like {} .foramt() on query
- good practice: t= (area, price) ; c.execute("….", t)
import sqlite3
conn = sqlite3.connect('Data_Folder/hotels.db')
conn
<sqlite3.Connection at 0x12a847c70>
c = conn.cursor()
c.execute("SELECT * FROM hotels WHERE area='south' and price='hi'")
<sqlite3.Cursor at 0x12a8339d0>
c.fetchall()
[('Grand Hotel', 'hi', 'south', 5)]
area, price = "south", "hi"
t = (area, price)
# key
c.execute('SELECT * FROM hotels WHERE area=? AND price=?', t)
print(c.fetchall())
<sqlite3.Cursor at 0x12a8339d0>
[('Grand Hotel', 'hi', 'south', 5)]
Exploring DB with NL
- Logic
- using trained Rasa interpreter to parser message
- result = entities dict
- define params = {} storing key-val of entities
- feed query by filtering by params
- execute query with condition AND (or .join(‘query’)
- Responses: default None; but result -> …and or but…
# Define find_hotels()
def find_hotels(params):
# Create the base query
query = 'SELECT * FROM hotels'
# Add filter clauses for each of the parameters
if len(params) > 0:
filters = ["{}=?".format(k) for k in params]
query += " WHERE " + " and ".join(filters)
# Create the tuple of values
t = tuple(params.values())
# Open connection to DB
conn = sqlite3.connect('hotels.db')
# Create a cursor
c = conn.cursor()
# Execute the query
c.execute(query, t)
# Return the results
return c.fetchall()
Above find and match any range combination
# Create the dictionary of column names and values
params = {"area": "south", "price":"lo"}
# Find the hotels that match the parameters
print(find_hotels(params))
# Define respond()
def respond(message):
# Extract the entities
entities = interpreter.parse(message)["entities"]
# Initialize an empty params dictionary
params = {}
# Fill the dictionary with entities
for ent in entities:
params[ent["entity"]] = str(ent["value"])
# Find hotels that match the dictionary
results = find_hotels(params)
# Get the names of the hotels and index of the response
names = [r[0] for r in results]
n = min(len(results),3)
# Select the nth element of the responses array
return responses[n].format(*names)
# Define respond()
def respond(message):
# Extract the entities
entities = interpreter.parse(message)["entities"]
# Initialize an empty params dictionary
params = {}
# Fill the dictionary with entities
for ent in entities:
params[ent["entity"]] = str(ent["value"])
# Find hotels that match the dictionary
results = find_hotels(params)
# Get the names of the hotels and index of the response
names = [r[0] for r in results]
n = min(len(results),3)
# Select the nth element of the responses array
return responses[n].format(*names)
print(respond("I want an expensive hotel in the south of town"))
Incremental Slot Filling and Negation
- memory-filled response incrementally
- Basic Memory - saving params in memory
- negation filtering response by negation or certainty
- tricky topic
- Negated entities - “no, not, etc” + NE
- ‘not sushi, maybe pizza?’
# Define a respond function, taking the message and existing params as input
def respond(message, params):
# Extract the entities
entities = interpreter.parse(message)["entities"]
# Fill the dictionary with entities
for ent in entities:
params[ent["entity"]] = str(ent["value"])
# Find the hotels
results = find_hotels(params)
names = [r[0] for r in results]
n = min(len(results), 3)
# Return the appropriate response
return responses[n].format(*names), params
# Initialize params dictionary
params = {}
# Pass the messages to the bot
for message in ["I want an expensive hotel", "in the north of town"]:
print("USER: {}".format(message))
response, params = respond(message, params)
print("BOT: {}".format(response))
Basic Negation
# Define negated_ents()
def negated_ents(phrase):
# Extract the entities using keyword matching
ents = [e for e in ["south", "north"] if e in phrase]
# Find the index of the final character of each entity
ends = sorted([phrase.index(e) + len(e) for e in ents])
# Initialise a list to store sentence chunks
chunks = []
# Take slices of the sentence up to and including each entitiy
start = 0
for end in ends:
chunks.append(phrase[start:end])
start = end
result = {}
# Iterate over the chunks and look for entities
for chunk in chunks:
for ent in ents:
if ent in chunk:
# If the entity is preceeded by a negation, give it the key False
if "not" in chunk or "n't" in chunk:
result[ent] = False
else:
result[ent] = True
return result
# Check that the entities are correctly assigned as True or False
for test in tests:
print(negated_ents(test[0]) == test[1])
# Define the respond function
def respond(message, params, neg_params):
# Extract the entities
entities = interpreter.parse(message)["entities"]
ent_vals = [e["value"] for e in entities]
# Look for negated entities
negated = negated_ents(message, ent_vals)
for ent in entities:
if ent["value"] in negated and negated[ent["value"]]:
neg_params[ent["entity"]] = str(ent["value"])
else:
params[ent["entity"]] = str(ent["value"])
# Find the hotels
results = find_hotels(params, neg_params)
names = [r[0] for r in results]
n = min(len(results),3)
# Return the correct response
return responses[n].format(*names), params, neg_params
# Initialize params and neg_params
params = {}
neg_params = {}
# Pass the messages to the bot
for message in ["I want a cheap hotel", "but not in the north of town"]:
print("USER: {}".format(message))
response, params, neg_params = respond(message, params, neg_params)
print("BOT: {}".format(response))
Statefulness
- Policy mapping user input with action
- Sophistication adding select action content-dependency
- additional memory, state machine e.g. traffic light (3 states)
- state dependency
- e-commerce - browsing, info, order completion, questions
- int used for states
- example: coffee ordering
- INIT state - order intent -> choose state
- ORDER -> take input from INIT and CHOOSE
- sequential dict()
# Define the INIT state
INIT = 0
# Define the CHOOSE_COFFEE state
CHOOSE_COFFEE = 1
# Define the ORDERED state
ORDERED = 2
# Define the policy rules
policy = {
(INIT, "order"): (CHOOSE_COFFEE, "ok, Columbian or Kenyan?"),
(INIT, "none"): (INIT, "I'm sorry - I'm not sure how to help you"),
(CHOOSE_COFFEE, "specify_coffee"): (ORDERED, "perfect, the beans are on their way!"),
(CHOOSE_COFFEE, "none"): (CHOOSE_COFFEE, "I'm sorry - would you like Colombian or Kenyan?"),
}
# Create the list of messages
messages = [
"I'd like to become a professional dancer",
"well then I'd like to order some coffee",
"my favourite animal is a zebra",
"kenyan"
]
# Call send_message() for each message
state = INIT
for message in messages:
state = send_message(policy, state, message)
# Define the states
INIT=0
CHOOSE_COFFEE=1
ORDERED=2
# Define the policy rules dictionary
policy_rules = {
(INIT, "ask_explanation"): (INIT, "I'm a bot to help you order coffee beans"),
(INIT, "order"): (CHOOSE_COFFEE, "ok, Columbian or Kenyan?"),
(CHOOSE_COFFEE, "specify_coffee"): (ORDERED, "perfect, the beans are on their way!"),
(CHOOSE_COFFEE, "ask_explanation"): (CHOOSE_COFFEE, "We have two kinds of coffee beans - the Kenyan ones make a slightly sweeter coffee, and cost $6. The Brazilian beans make a nutty coffee and cost $5.")
}
# Define send_messages()
def send_messages(messages):
state = INIT
for msg in messages:
state = send_message(state, msg)
# Send the messages
send_messages([
"what can you do for me?",
"well then I'd like to order some coffee",
"what do you mean by that?",
"kenyan"
])
# Define respond()
def respond(message, params, prev_suggestions, excluded):
# Interpret the message
parse_data = interpret(message)
# Extract the intent
intent = parse_data["intent"]["name"]
# Extract the entities
entities = parse_data["entities"]
# Add the suggestion to the excluded list if intent is "deny"
if intent == "deny":
excluded.extend(prev_suggestions)
# Fill the dictionary with entities
for ent in entities:
params[ent["entity"]] = str(ent["value"])
# Find matching hotels
results = [
r
for r in find_hotels(params, excluded)
if r[0] not in excluded
]
# Extract the suggestions
names = [r[0] for r in results]
n = min(len(results), 3)
suggestions = names[:2]
return responses[n].format(*names), params, suggestions, excluded
# Initialize the empty dictionary and lists
params, suggestions, excluded = {}, [], []
# Send the messages
for message in ["I want a mid range hotel", "no that doesn't work for me"]:
print("USER: {}".format(message))
response, params, suggestions, excluded = respond(message, params, suggestions, excluded)
print("BOT: {}".format(response))
Question and Queuing Answers
- State machine building up memory and rules
- complexity reduction is needed
- e.g. coffee filter added in sales
- add additional states with policy on handling yes or no
- adding more states up complexity
- solution: Pending actions -> select action + pending_action (None)
- pending_action is saved in outer scope
- if ‘yes’ intent, pending, else none
- Pending state transitions -> authentication states -> request info -> transition to order state -> order
# Define policy()
def policy(intent):
# Return "do_pending" if the intent is "affirm"
if intent == "affirm":
return "do_pending", None
# Return "Ok" if the intent is "deny"
if intent == "deny":
return "Ok", None
if intent == "order":
return "Unfortunately, the Kenyan coffee is currently out of stock, would you like to order the Brazilian beans?", "Alright, I've ordered that for you!"
Incorporate it into send_message() func
# Define send_message()
def send_message(pending, message):
print("USER : {}".format(message))
action, pending_action = policy(interpret(message))
if action == 'do_pending' and pending is not None:
print("BOT : {}".format(pending))
else:
print("BOT : {}".format(action))
return pending_action
# Define send_messages()
def send_messages(messages):
pending = None
for msg in messages:
pending = send_message(pending, msg)
# Send the messages
send_messages([
"I'd like to order some coffee",
"ok yes please"
])
# Define the states
INIT=0
AUTHED=1
CHOOSE_COFFEE=2
ORDERED=3
# Define the policy rules
policy_rules = {
(INIT, "order"): (INIT, "you'll have to log in first, what's your phone number?", AUTHED),
(INIT, "number"): (AUTHED, "perfect, welcome back!", None),
(AUTHED, "order"): (CHOOSE_COFFEE, "would you like Columbian or Kenyan?", None),
(CHOOSE_COFFEE, "specify_coffee"): (ORDERED, "perfect, the beans are on their way!", None)
}
# Define send_messages()
def send_messages(messages):
state = INIT
pending = None
for msg in messages:
state, pending = send_message(state, pending, msg)
# Send the messages
send_messages([
"I'd like to order some coffee",
"555-12345",
"kenyan"
])
# Define chitchat_response()
def chitchat_response(message):
# Call match_rule()
response, phrase = match_rule(eliza_rules, message)
# Return none is response is "default"
if response == "default":
return None
if '{0}' in response:
# Replace the pronouns of phrase
phrase = replace_pronouns(phrase)
# Calculate the response
response = response.format(phrase)
return response
# Define send_message()
def send_message(state, pending, message):
print("USER : {}".format(message))
response = chitchat_response(message)
if response is not None:
print("BOT : {}".format(response))
return state, None
# Calculate the new_state, response, and pending_state
new_state, response, pending_state = policy_rules[(state, interpret(message))]
print("BOT : {}".format(response))
if pending is not None:
new_state, response, pending_state = policy_rules[pending]
print("BOT : {}".format(response))
if pending_state is not None:
pending = (pending_state, interpret(message))
return new_state, pending
# Define send_messages()
def send_messages(messages):
state = INIT
pending = None
for msg in messages:
state, pending = send_message(state, pending, msg)
# Send the messages
send_messages([
"I'd like to order some coffee",
"555-12345",
"do you remember when I ordered 1000 kilos by accident?",
"kenyan"
])
Frontiers of dialogue tech
- many applications but not in chatbot
- context-relevant data needed
- Neural Conversational Model Seq2seq
- machine translation
- NN reads message, buiding up hidden vectors meaning
- reverse -> output sequence
- totally different
No specified intent or etc, utterly data-driven
- not easy to integrate DB and API logic
- previous hand-crafted, seq2seq data-driven
- ML based:
- NLU
- Dialogue state manageuer
- API logic (connector to real world, DB)
- NL reponse generator
- ‘Human pretend to be a bot: ‘Wizard of Oz’ technique
- RL - receives reward for successful conversation, improves over time
Language generation
- practically, bot not recommended - better crafted than generated
- but fun topic
- NN trained can generate text from certain topic database, e.g. simpson scripts
# Feed the 'seed' text into the neural network
seed = "i'm gonna punch lenny in the back of the"
# Iterate over the different temperature values
for temperature in [0.2, 0.5, 1.0, 1.2]:
print("\nGenerating text with riskiness : {}\n".format(temperature))
# Call the sample_text function
print(sample_text(seed, temperature))
https://www.datacamp.com/community/tutorials/facebook-chatbot-python-deploy