# Distributional Semantics


For this notebook, we'll be using the 500 document Brown corpus included in NLTK

In [1]:
from nltk.corpus import brown

This notebook is divided up into two independent parts: the first uses PMI for distinguishing good collocations, and the second involves building a vector space model for document retrieval.

For the PMI portion, we'll use a function which extracts the information we need for a particular two word collocation, namely counts of each word individually, counts of the collocation, and the total number of word tokens in the corpus, and then calculates the PMI:

In [2]:
import math

def get_PMI_for_collocation_brown(word1,word2):
    word1_count = 0
    word2_count = 0
    both_count = 0
    total_count = 0.0 # so that division results in a float
    for sent in brown.sents():
        sent = [word.lower() for word in sent]
        for i in range(len(sent)):
            total_count += 1
            if sent[i] == word1:
                word1_count += 1
                if i < len(sent) - 1 and sent[i + 1] == word2:
                    both_count += 1
            elif sent[i] == word2:
                word2_count += 1
    return math.log((both_count/total_count)/((word1_count/total_count)*(word2_count/total_count)), 2)
                
        

Note that in a typical use case, we probably wouldn't do it this way, since we'd likely want to calculate PMI across many different words, and collecting the statisitcs for this can be done in a single pass across the corpus for all words, and then the PMI calculated in a separate function. Anyway, let's compare the PMI for two phrases, "hard work" and "some work"

In [3]:
print(get_PMI_for_collocation_brown("hard","work"))
print(get_PMI_for_collocation_brown("some","work"))

5.237244531670497
1.9135320271049516


Based on PMI, "hard work" appears to be a much better collocation than "some work", which matches our intuition. Go ahead and try out this out some other collocations. 

For the second part of the notebook, let's create a sparse document-term matrix, using sci-kit learn. We will do a document-term matrix rather than term-document because we will be performing SVD dimensionality reduction to produce dense document representations for document retrevial. Note that this is actually identical to creating a BOW feature representation for each document; the difference comes in how we used the representation. 

In [4]:
from sklearn.feature_extraction import DictVectorizer

def get_BOW(text):
    BOW = {}
    for word in text:
        BOW[word.lower()] = BOW.get(word.lower(),0) + 1
    return BOW

texts = []
for fileid in brown.fileids():
    texts.append(get_BOW(brown.words(fileid)))

vectorizer = DictVectorizer()
brown_matrix = vectorizer.fit_transform(texts)
print(brown_matrix)


  (0, 49)	1.0
  (0, 58)	1.0
  (0, 169)	1.0
  (0, 181)	1.0
  (0, 205)	1.0
  (0, 238)	1.0
  (0, 322)	33.0
  (0, 373)	3.0
  (0, 374)	3.0
  (0, 393)	87.0
  (0, 395)	4.0
  (0, 405)	88.0
  (0, 454)	4.0
  (0, 465)	1.0
  (0, 695)	1.0
  (0, 720)	1.0
  (0, 939)	1.0
  (0, 1087)	1.0
  (0, 1103)	1.0
  (0, 1123)	1.0
  (0, 1159)	1.0
  (0, 1170)	1.0
  (0, 1173)	1.0
  (0, 1200)	3.0
  (0, 1451)	1.0
  :	:
  (499, 49161)	1.0
  (499, 49164)	1.0
  (499, 49242)	1.0
  (499, 49253)	1.0
  (499, 49275)	1.0
  (499, 49301)	1.0
  (499, 49313)	1.0
  (499, 49369)	1.0
  (499, 49385)	1.0
  (499, 49386)	4.0
  (499, 49390)	2.0
  (499, 49410)	2.0
  (499, 49446)	1.0
  (499, 49576)	1.0
  (499, 49590)	1.0
  (499, 49613)	3.0
  (499, 49691)	42.0
  (499, 49694)	3.0
  (499, 49697)	3.0
  (499, 49698)	1.0
  (499, 49707)	17.0
  (499, 49708)	1.0
  (499, 49710)	4.0
  (499, 49711)	1.0
  (499, 49797)	1.0


Our matrix is sparse: for instance, columns 0-48 in row 0 are empty, and are just left out, only the rows and columns with values other than zeros are displayed

Rather than removing stopwords as we did for text classification, let's add some idf weighting to this matrix. Scikit-learn has a built-in tf-idf transformer for just this purpose.

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(smooth_idf=False,norm=None)

brown_matrix_tfidf = transformer.fit_transform(brown_matrix)

print(brown_matrix_tfidf)

  (0, 49646)	1.72981116493
  (0, 49613)	1.36816932336
  (0, 49596)	3.70663186543
  (0, 49386)	9.98833379406
  (0, 49378)	8.73162901565
  (0, 49313)	2.62964061975
  (0, 49301)	7.37407593121
  (0, 49292)	2.18417017703
  (0, 49224)	3.38596670193
  (0, 49147)	6.0
  (0, 49041)	3.40794560865
  (0, 49003)	22.2100968809
  (0, 49001)	5.74160535314
  (0, 48990)	16.8467729363
  (0, 48951)	4.72970144863
  (0, 48950)	4.93935194012
  (0, 48932)	3.9565115604
  (0, 48867)	7.04612032287
  (0, 48777)	1.41855034766
  (0, 48771)	13.6942100975
  (0, 48769)	6.23642898412
  (0, 48753)	1.29571424415
  (0, 48749)	3.19841940751
  (0, 48720)	1.16487464319
  (0, 48670)	2.19743194588
  :	:
  (499, 2710)	3.1202635362
  (499, 2688)	2.04412410338
  (499, 2670)	3.9565115604
  (499, 2611)	4.27016911926
  (499, 2468)	6.52146091786
  (499, 2439)	4.1700856607
  (499, 2415)	4.12263300785
  (499, 2413)	2.32033750431
  (499, 2388)	2.09661428601
  (499, 2358)	6.11599580975
  (499, 2290)	61.0
  (499, 2289)	7.55330245138
  (499

Next, let's apply SVD. Scikit-learn does not expose the internal details of the decomposition, we just use the [TruncatedSVD class](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) directly get a matrix with k dimensions. Since the Brown corpus is a fairly small corpus, we'll do k=100

In [6]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100)
brown_matrix_lowrank = svd.fit_transform(brown_matrix_tfidf)

print(brown_matrix_lowrank.shape)
print(brown_matrix_lowrank)

(500, 100)
[[ 242.5582075    22.94672559   -9.13296898 ...,   -6.23957857
     5.39073838   -6.7454443 ]
 [ 248.38544615   25.02117041  -21.15649044 ...,   -8.00926973
     6.68446001   -4.66650109]
 [ 236.70180717   24.01329017  -11.14248995 ...,   -6.4456885    -8.13839828
     3.96625212]
 ..., 
 [ 258.64888365 -113.70379768   26.56777272 ...,    7.37187006
    -0.96696039  -10.34101781]
 [ 291.34612775   12.99993635  -26.75489983 ...,   -4.27981877    5.24324
     4.38317943]
 [ 273.31546131  -31.90748229  -17.78595109 ...,   -5.03546029
    -6.22586026   -2.21868552]]


The returned matrix corresponds to the transformed documents, $U \Sigma$, after SVD factorisation, $X \approx U \Sigma V^T$, applied to `brown_matrix_tfidf`, as $X$. Note that the resulting matrix is not sparse.

The last thing we'll do is build a very simple document retrevial system based on the vector space model we've built: it will take some query input, apply all the transformations we have defined above, then find the Brown document with the highest cosine similarity to the query document. Here we are using scipy's cosine distance function; we actually find the smallest distance instead of the largest similarity.

In [12]:
import numpy as np

def transform_query(query_text):
    return svd.transform(transformer.transform(vectorizer.transform([get_BOW(query_text.split())])))[0]

def get_best_doc_num(query, m = brown_matrix_lowrank):
    dists = np.dot(m, query) / np.sqrt(np.einsum('ij,ij->i', m, m))
    # the above finds q . m[i] for all rows, then normalises (element-wise) by each row's 2-norm, m[i].m[i]
    best_doc = np.argmax(dists)
    return best_doc


Let's test this out with a couple of sets of key words, with the idea of getting a religious text in the first example, and a mathematics text in the second (the Brown corpus has both). We'll also look at the specific vectors and distances involved.

In [13]:
def try_query(query_text):
    query = transform_query(query_text)
    doc_num = get_best_doc_num(query)
    doc_vec = brown_matrix_lowrank[doc_num]
    doc_text = brown.words(brown.fileids()[doc_num])
    print("query text")
    print(query_text)
    print("query vector")
    print(query)
    print("best document vector")
    print(doc_vec)
    print("cosine similarity from query to document")
    print(np.dot(doc_vec, query) / np.sqrt(np.dot(doc_vec,doc_vec) * np.dot(query,query)))
    print("best document sample")
    print(doc_text[:50])

try_query("heaven hell devil lord")
try_query("matrix algebra eigenvalue")

query text
heaven hell devil lord
query vector
[  2.54452687e-02  -7.56679974e-02   1.41123579e-02   2.74106732e-05
   4.26817295e-02  -1.00728318e-01  -4.57309759e-02   3.57125996e-02
   5.52501704e-02   4.08454940e-02   2.68765592e-02  -4.91123930e-02
   7.18247112e-02   4.36630663e-03  -4.13951850e-02   4.52565463e-02
  -1.23907966e-01   1.44283934e-02   3.10530881e-03   2.71884851e-02
  -7.96231396e-02   8.05651038e-03   9.33174406e-02  -3.21580624e-02
   7.70688069e-02  -1.50288779e-02   2.54168433e-03  -2.92396752e-02
   1.29587920e-01  -5.56075600e-02  -6.85467199e-02   8.06581653e-03
   5.60641633e-02   3.48097625e-02  -8.53515946e-02   1.19141345e-02
   1.10413528e-02  -1.30118767e-03   2.58580887e-02  -4.56148746e-02
   8.02522735e-02  -1.40909593e-01  -2.94006023e-02  -6.41316680e-02
  -4.85668189e-02  -3.77222269e-02   1.79331027e-02  -5.39532233e-02
  -7.60884593e-02  -9.01020597e-02  -6.36670784e-02   1.41391452e-01
  -4.41859967e-02  -8.52973446e-02   5.05782326e-02  -7.