Abstract
Tracing semantic patterns over time on the basis of texts is still in its infancy. Most approaches build on a linguistic principle which states that the meanings of words are determined ‘by the company they keep’. In other words, meanings arise from contexts defined as distributions of words, which suggests that we can trace meanings over time by examining changing contexts. Topic modelling is at this moment the only technique based on the principle of word distributions that has gone beyond an experimental stage and has proven its value by achieving results that domain experts (in this case historians not necessarily involved in computer-assisted research) recognize.
This paper discusses a new tool, dubbed the ‘Frame Generator’, aimed at meaningfully reducing a set of (possibly thousands of) Dutch texts to word patterns that cut across the distributions generated by topic modelling, thus providing additional insight into the content of the dataset. The method implemented builds on topic modelling by combining it with two other proven techniques: (1) the automatic extraction of keywords and (2) the identification of collocates. The Python source code of the tool, offering a command line interface, is available for download on GitHub (https://github.com/jlonij/frame-generator). An online demo with a graphical user interface, showcasing the tool’s main functionality for a small dataset, can be found at http://kbresearch.nl/frames/.
The Frame Generator was developed to assist in the investigation of popular perspectives on the concept of ‘Europe’ arising from the KB collection of Dutch historical newspapers. To this end, a dataset was prepared of articles that mentioned the word ‘Europe’ at least once. A subset of articles was then selected on the basis of (Dutch-language) synonyms for the words ‘unity’ and ‘unification’ (such as ‘integration’, ‘agreement’, ‘settlement’, ‘consensus’, ‘treaty’, ‘harmony’, etc). This subset was assumed to contain news articles that discuss Europe as a unified political / cultural / economic entity, or as an entity involved in a process of unification. The other subset was based on synonyms for competitions (such as ‘match’, ‘prize’, ‘winner’, ‘cup’, etc); this subset was assumed to contain articles on sports and other competitions.
The Frame Generator process of analyzing these datasets consists of four stages. The first stage concerns the pre-processing of the dataset. During this stage the dataset is cleaned by normalizing spelling variations and correcting OCR errors on the basis of user-provided lists of regular expressions and their replacements. In addition, the dataset is tokenized, lemmatized and part-of-speech tagged with the Natural Language Processing suite Frog (https://languagemachines.github.io/frog/). The user has the option of splitting larger documents into smaller units of analysis by specifying the maximum number of sentences to be contained in each unit.
The second stage in the process is topic modelling, which generates specific, substantive themes or topics based on frequently recurring distributions of words. The Frame Generator offers two methods of topic modelling: one based on Mallet (http://mallet.cs.umass.edu), the other on the Gensim topic modelling library (https://radimrehurek.com/gensim/). The user is able to control the number of topics generated and number of words making up each topic by means of various command line arguments. This stage also involves the manual, hermeneutic interpretation of the topics based on historical domain knowledge.
The third stage focuses on the extraction of a single, ranked list keywords from the set of topics resulting from the previous stage. The relevance of each word occurring in the set of topics is determined by taking the sum of the probability scores for the word over all topics in which it occurs. A word is accorded the status of keyword if its score reaches a certain threshold, set at the discretion of the researcher. The Frame Generator can also produce a keyword list on the basis of tf-idf scores, thus allowing the researcher to compare the results of different approaches. The option is available to restrict the candidates for the keyword list to words with specific part-of-speech tags. The keywords thus obtained may be regarded as core elements in a series of thematically uniform texts; their significance arises from the frequency of their occurrence within as well as across topics.
The fourth and final stage of the analysis process consists of contextualising the keywords by finding collocates in the texts from which they were originally extracted. The user sets a maximum word distance from the keyword as well as the direction (left, right, or both) in which collocates must occur in order to qualify. As with the extraction of keywords, the option to include only specific part-of-speech tags is also provided for collocates. The set of collocates thus gathered for a given keyword is called a ‘frame’. The words appearing in a frame are ordered by the frequency of their co-occurrence with and their distance to the keyword with which they are associated, expressing their significance in framing a specific keyword.
The results of each of these stages are saved and accessible to the user in the form of comma-separated values (CSV) files. These can, for example, be used to visualise the graph of the keywords and their collocates in an application such as Gephi (https://gephi.org) in order to facilitate the interpretation of the results. By creating such network graphs for the Frame Generator results for a number of different time periods (see Figure 1 for an example) we found that newspaper reporting on ‘European unity’, while showing a remarkable degree of continuity, became less rich rhetorically, less international, and more focused on institutional technocracy than on intra-continental relations over the course of the twentieth century.
This paper hypothesises that the Frame Generator, by laying bare the fundamental patterns in sets of thematically coherent texts, enables historians to better determine continuities and discontinuities in expressions of public opinion. The Frame Generator’s performance depends on that of its constituent tools (such as topic modelling), which have been described in the literature. Its advantages include its adaptability to other languages (given the availability of part-of-speech tagging), its flexibility (the user can set all variables) and its ‘all-in-one’ packaging (it requires no programming skills while generating not just frames but also keywords and topics). For domain experts (historians) the proof of the pudding will be in the eating: does this particular combination of tools – topic modelling, keyword extraction and identification of keyword collocates – offer useful results? The question can only be answered by running the tool on a variety of relatively homogenous datasets.
This paper discusses a new tool, dubbed the ‘Frame Generator’, aimed at meaningfully reducing a set of (possibly thousands of) Dutch texts to word patterns that cut across the distributions generated by topic modelling, thus providing additional insight into the content of the dataset. The method implemented builds on topic modelling by combining it with two other proven techniques: (1) the automatic extraction of keywords and (2) the identification of collocates. The Python source code of the tool, offering a command line interface, is available for download on GitHub (https://github.com/jlonij/frame-generator). An online demo with a graphical user interface, showcasing the tool’s main functionality for a small dataset, can be found at http://kbresearch.nl/frames/.
The Frame Generator was developed to assist in the investigation of popular perspectives on the concept of ‘Europe’ arising from the KB collection of Dutch historical newspapers. To this end, a dataset was prepared of articles that mentioned the word ‘Europe’ at least once. A subset of articles was then selected on the basis of (Dutch-language) synonyms for the words ‘unity’ and ‘unification’ (such as ‘integration’, ‘agreement’, ‘settlement’, ‘consensus’, ‘treaty’, ‘harmony’, etc). This subset was assumed to contain news articles that discuss Europe as a unified political / cultural / economic entity, or as an entity involved in a process of unification. The other subset was based on synonyms for competitions (such as ‘match’, ‘prize’, ‘winner’, ‘cup’, etc); this subset was assumed to contain articles on sports and other competitions.
The Frame Generator process of analyzing these datasets consists of four stages. The first stage concerns the pre-processing of the dataset. During this stage the dataset is cleaned by normalizing spelling variations and correcting OCR errors on the basis of user-provided lists of regular expressions and their replacements. In addition, the dataset is tokenized, lemmatized and part-of-speech tagged with the Natural Language Processing suite Frog (https://languagemachines.github.io/frog/). The user has the option of splitting larger documents into smaller units of analysis by specifying the maximum number of sentences to be contained in each unit.
The second stage in the process is topic modelling, which generates specific, substantive themes or topics based on frequently recurring distributions of words. The Frame Generator offers two methods of topic modelling: one based on Mallet (http://mallet.cs.umass.edu), the other on the Gensim topic modelling library (https://radimrehurek.com/gensim/). The user is able to control the number of topics generated and number of words making up each topic by means of various command line arguments. This stage also involves the manual, hermeneutic interpretation of the topics based on historical domain knowledge.
The third stage focuses on the extraction of a single, ranked list keywords from the set of topics resulting from the previous stage. The relevance of each word occurring in the set of topics is determined by taking the sum of the probability scores for the word over all topics in which it occurs. A word is accorded the status of keyword if its score reaches a certain threshold, set at the discretion of the researcher. The Frame Generator can also produce a keyword list on the basis of tf-idf scores, thus allowing the researcher to compare the results of different approaches. The option is available to restrict the candidates for the keyword list to words with specific part-of-speech tags. The keywords thus obtained may be regarded as core elements in a series of thematically uniform texts; their significance arises from the frequency of their occurrence within as well as across topics.
The fourth and final stage of the analysis process consists of contextualising the keywords by finding collocates in the texts from which they were originally extracted. The user sets a maximum word distance from the keyword as well as the direction (left, right, or both) in which collocates must occur in order to qualify. As with the extraction of keywords, the option to include only specific part-of-speech tags is also provided for collocates. The set of collocates thus gathered for a given keyword is called a ‘frame’. The words appearing in a frame are ordered by the frequency of their co-occurrence with and their distance to the keyword with which they are associated, expressing their significance in framing a specific keyword.
The results of each of these stages are saved and accessible to the user in the form of comma-separated values (CSV) files. These can, for example, be used to visualise the graph of the keywords and their collocates in an application such as Gephi (https://gephi.org) in order to facilitate the interpretation of the results. By creating such network graphs for the Frame Generator results for a number of different time periods (see Figure 1 for an example) we found that newspaper reporting on ‘European unity’, while showing a remarkable degree of continuity, became less rich rhetorically, less international, and more focused on institutional technocracy than on intra-continental relations over the course of the twentieth century.
This paper hypothesises that the Frame Generator, by laying bare the fundamental patterns in sets of thematically coherent texts, enables historians to better determine continuities and discontinuities in expressions of public opinion. The Frame Generator’s performance depends on that of its constituent tools (such as topic modelling), which have been described in the literature. Its advantages include its adaptability to other languages (given the availability of part-of-speech tagging), its flexibility (the user can set all variables) and its ‘all-in-one’ packaging (it requires no programming skills while generating not just frames but also keywords and topics). For domain experts (historians) the proof of the pudding will be in the eating: does this particular combination of tools – topic modelling, keyword extraction and identification of keyword collocates – offer useful results? The question can only be answered by running the tool on a variety of relatively homogenous datasets.
Original language | English |
---|---|
Title of host publication | Conference DH Benelux 2017 |
Publication status | Published - 2017 |