{"id":359,"date":"2022-04-03T14:58:42","date_gmt":"2022-04-03T14:58:42","guid":{"rendered":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/?p=359"},"modified":"2022-05-06T14:04:19","modified_gmt":"2022-05-06T14:04:19","slug":"vector-spaces-and-teaching-your-computer-to-read","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/2022\/04\/03\/vector-spaces-and-teaching-your-computer-to-read\/","title":{"rendered":"Vector Spaces and Teaching Your Computer to Read"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The key issue in using text data is the sheer number of words we have to learn about! To make matters worse, we do not have the same amount of information about each word. This is because the relative frequencies of words are incredibly skewed \u2013 in a given corpus, only a small number of words will make up the majority of the word count. We\u2019ll illustrate this with an example corpus: the Sherlock Holmes series. This has a total word count around half a million, with just over 17500 unique words. The sorted frequencies for each word are plotted below, and you can already see that the distribution is heavily skewed towards the more common words. In fact, the 100 most common words make up almost 60% of the total word count, while half of the words in the vocabulary appear only once or twice.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"556\" height=\"371\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockwordfrequencies.png\" alt=\"\" class=\"wp-image-360\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockwordfrequencies.png 556w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockwordfrequencies-300x200.png 300w\" sizes=\"auto, (max-width: 556px) 100vw, 556px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">This issue persists even for very large text corpora (the Oxford English Corpus has 2 billion words and the top 100 words still make up half of that count), and causes enormous problems for dealing with text data since we have lots of information about a handful of words and only a little about the rest. Modern language models usually handle this problem by being careful about how they choose to represent words. Simple methods treat each word as its own distinct entity: a string of characters or an index in a dictionary. In reality, though, words are related to each other, and a more informative representation would capture similarities and differences in word meanings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"> Word embeddings (or word vectors) are representations of words as points in a vector space, where words with similar meanings are represented by points that are close together. This reduces the dimensionality of text datasets and makes it possible to transfer knowledge between words with similar meanings, effectively increasing the amount of data we have about each word.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Example &#8211; Food Vectors<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To illustrate this, an example of how you might consider representing foods as points (or vectors) in 2d space:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodvectors.png\" alt=\"\" class=\"wp-image-361\" width=\"513\" height=\"429\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodvectors.png 1026w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodvectors-300x251.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodvectors-1024x856.png 1024w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodvectors-768x642.png 768w\" sizes=\"auto, (max-width: 513px) 100vw, 513px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Here, I\u2019ve decided the two key pieces of information about any foodstuff are temperature (hot\/cold) and state (solid\/liquid). This places meals that you\u2019d consume in similar situations close together in space. If we measure vector similarity by the cosine similarity (the cosine of the angle between them), we can compute a score for how similar certain words are on a scale from 1 (same meaning) to -1 (opposite meanings). In our example, similarity(soup,stew) = cos(10) = 0.98, while similarity(soup,salad) = cos(180) = -1.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-8f761849 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"856\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles2-1024x856.png\" alt=\"\" class=\"wp-image-362\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles2-1024x856.png 1024w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles2-300x251.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles2-768x642.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles2.png 1026w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"856\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles1-1024x856.png\" alt=\"\" class=\"wp-image-363\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles1-1024x856.png 1024w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles1-300x251.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles1-768x642.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodangles1.png 1026w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n<\/div>\n<\/div>\n\n\n\n<p class=\"wp-block-paragraph\">This representation also gives rise to some interesting observations, since mathematical operations like addition and subtraction have natural definitions for vectors. For example, considering the words as vectors gives sense to the sum \u201cyoghurt \u2013 cold + hot\u201d, which has the answer\u2026<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodsums-1024x856.png\" alt=\"\" class=\"wp-image-364\" width=\"512\" height=\"428\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodsums-1024x856.png 1024w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodsums-300x251.png 300w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodsums-768x642.png 768w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/foodsums.png 1026w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">soup.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Obviously, this representation is not perfect (is there a meaningful difference between soup and hot yoghurt?), but it\u2019s not hard to imagine that if we increased the number of dimensions &#8211; by adding on extra directions like sweet\/savoury, for example &#8211; it would be good enough to represent most of the meaningful differences and similarities between foods. In practice, when we use a model to learn word embeddings, the individual co-ordinates do not correspond to easily understood concepts like they did in the example above. However, we can still find interesting linear relationships between words: relationships like \u201cking \u2013 man + woman = queen\u201d still hold, and directions can be found that correspond to grammatical ideas like tense, so that \u201ceat + &lt;past tense&gt; = ate\u201d.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">word2vec<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The key question now is how to find such a representation! This is usually done by training a model for some classification task that is easy to evaluate, and using the resulting fitted parameters as word embeddings. Perhaps the most famous example is word2vec. In its \u201cskip-gram with negative sampling\u201d variant, the prediction task is to estimate how likely any given two words are to appear near each other in a sentence. The motivation here is that words are likely to have similar meanings if they appear in similar contexts &#8211; if I tell you \u201cI ate phlogiston for dinner\u201d, you\u2019ll be able to tell from context (proximity to the words <em>ate<\/em> and<em> dinner<\/em>) that <em>phlogiston <\/em>is most likely a food and could hazard a guess at how it would be used in other sentences.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can easily get positive examples from the text by taking pairs of words that did appear near each other, and negative examples by randomly selecting some noise words. The embeddings are fitted to maximise classification accuracy on this dataset, and the model is designed so that the resulting embeddings have high cosine similarity if the words they represent appear in similar contexts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll use the Sherlock Holmes corpus again as an example, using the <a href=\"https:\/\/radimrehurek.com\/gensim\/auto_examples\/tutorials\/run_word2vec.html\">Gensim implementation<\/a> for python to learn 100-dimensional word vectors. The results are more easily visualised by projecting them into 2 dimensions with principal component analysis &#8211; the resulting projection preserves some of the structure of the vector space, including some clusters of similar words. See below for a visualisation of 100 common words from the dataset:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"882\" height=\"955\" src=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockvectorsblog.png\" alt=\"\" class=\"wp-image-365\" srcset=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockvectorsblog.png 882w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockvectorsblog-277x300.png 277w, https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/04\/sherlockvectorsblog-768x832.png 768w\" sizes=\"auto, (max-width: 882px) 100vw, 882px\" \/><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Some clusters of words with high cosine similarity have been highlighted. The model did well at grouping words with similar meanings together \u2013 some examples are listed in the table below. The model was also reasonably successful at grouping words by syntactic meaning &#8211; nouns, adjectives, and verbs were usually grouped together, with verbs even grouped by tense as well as meaning.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th>Word<\/th><th>Most similar to:<\/th><\/tr><\/thead><tbody><tr><td>you<\/td><td>ye (0.85), yourselves (0.83)<\/td><\/tr><tr><td>say<\/td><td>saying (0.87), bet (0.85)<\/td><\/tr><tr><td>said<\/td><td>answered (0.88), cried (0.83)<\/td><\/tr><tr><td>brother<\/td><td>father (0.88), son (0.88)<\/td><\/tr><tr><td>sister<\/td><td>mother (0.93), wife (0.92)<\/td><\/tr><tr><td>coat<\/td><td>overcoat (0.94), waistcoat (0.94)<\/td><\/tr><tr><td>crime<\/td><td>murder (0.87), committed (0.85)<\/td><\/tr><tr><td>holmes<\/td><td>macdonald (0.76), mcmurdo (0.75)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Of course, those were the highlights of this particular set of embeddings \u2013 this corpus is actually on the small side so the semantic similarities found for some words were complete nonsense. The most useful thing about word embeddings, however, is how transferable they are across corpora \u2013 it is common to use word embeddings trained on a big corpus in a language model for a small corpus, and many sets of pre-trained word embeddings are available for this purpose.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An interesting property of word embeddings is that the embedding spaces for different languages often share a similar structure &#8211; embeddings trained with word2vec for different languages have a similar geometric structure, and it is even possible to learn a linear map between the embedding spaces that allows for translation of words. The matrix for this map can be trained using a list of translations for some common words, and the resulting projections are surprisingly effective at translating words, even allowing for the detection of errors in a translation dictionary.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Further Reading<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/web.stanford.edu\/~jurafsky\/slp3\/\">Speech and Language Processing<\/a> (Chapter 6: Vector Semantics and Embeddings) &#8211; Dan Jurafsky and James H. Martin<\/li><li><a href=\"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-content\/uploads\/sites\/40\/2022\/01\/Statistics-and-Data-Science-for-Text-Data-Connie-Trojan.pdf\">Statistics and Data Science for Text Data<\/a> \u2013 Connie Trojan (my dissertation)<\/li><li><a href=\"https:\/\/arxiv.org\/abs\/1310.4546\">Distributed Representations of Words and Phrases and their Compositionality<\/a> &#8211; Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean<\/li><li><a href=\"https:\/\/arxiv.org\/abs\/1309.4168\">Exploiting Similarities among Languages for Machine Translation<\/a> &#8211; Tomas Mikolov, Quoc V. Le, and Ilya Sutskever<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>This post is about finding representations of words as points in a vector space, where words with similar meanings are represented by points that are close together. This representation will give mathematical meaning to relationships like \u201cking \u2013 man + woman = queen\u201d.<\/p>\n","protected":false},"author":43,"featured_media":422,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"slim_seo":{"title":"Vector Spaces and Teaching Your Computer to Read - Connie Trojan","description":"This post is about finding representations of words as points in a vector space, where words with similar meanings are represented by points that are close toge"},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-359","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/posts\/359","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/users\/43"}],"replies":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/comments?post=359"}],"version-history":[{"count":13,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/posts\/359\/revisions"}],"predecessor-version":[{"id":479,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/posts\/359\/revisions\/479"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/media\/422"}],"wp:attachment":[{"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/media?parent=359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/categories?post=359"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/connie-trojan\/wp-json\/wp\/v2\/tags?post=359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}