DemocraSci

A research platform for Data-Driven Democracy Studies in Switzerland

Abstract

Both social sciences and humanities are currently shifting from classical research methodologies (such as surveys or close reading) to the adoption of data science techniques. However, the emerging research areas of social data science and digital humanities are still impeded by a lack of easily accessible, structured data. At the same time, large amounts of valuable records are stored in archives and libraries, but are often stored in formats that are not suitable for data- driven research. Efforts to digitize and structure these records are often undertaken in an improvised and isolated way – in other words, the wheel is reinvented for every such project. An example of this would be the case of the Swiss parliament archives. These compile all the Parliament proceedings since 1890 until now, and therefore constitute an extremely valuable corpora of information for political scientists. However, despite the documents are digitized, it is still quite difficult for researchers to extract comprehensive and exhaustive information from it.

Read the article about this project on our blog:

A Trip Through Swiss Politics and History

Description

Problem:

  • Develop a scalable and re-usable data processing chain to extract structured information from archival records,

  • Apply it to a large corpus of scanned proceedings of the Swiss parliament spanning 125 years of Swiss history, which is made available by the Swiss Federal Archive.

  • Develop user-friendly, interactive data analysis and visualization tools to promote the use of the resulting data set by political scientists and the public.

Solution:

A three steps work-flow where we will tackle the following problems:

  • Data preprocessing: from the layout analysis to the entity extraction. The product of it will a structured database of the parliamentary proceedings.

  • Natural language analysis: for topic modeling, named entity disambiguation, etc.

  • Knowledge graph construction: all the previous results will enable the construction of a knowledge graph which – in the context of the parliamentary proceedings addressed in this project – links entities such as members of parliament, political parties and fractions, committees, Swiss cantons and cities, policy topics, and legislative processes. This will allow political researchers a better parsing of the information, network dynamics analysis, predictions on the graph, etc.

More insights on each of these steps are described in the section State of the project.

Impact:

The resulting research platform will be of great value to political scientists, historians, social scientists, and computer scientists. It will create new avenues for data-driven research on topics like political polarization, party cohesion, government formation, strategic behavior, political representation, and party formation. It will allow historians to reconstruct a quantitative account of Swiss political history over the last 125 years. It will enable sociologists to link changing fault lines in the Federal Assembly to shifts in socioeconomic factors. It will provide resources for data-driven journalists. And it will give computer scientists a multi-lingual ground truth dataset, with possible applications in opinion mining and machine translation. Besides, the data processing chain developed to extract structured information from unstructured, scanned records is of interest beyond political science. We see great potential, for example in the processing and analysis of medical records in health applications and the mining of historical documents in digital humanities. Such methods are of growing importance for researchers in the ETH domain and the project will thus foster the SDSC’s attractiveness for those researchers.

Detailed overview of the project

In the present project, first we aim at processing the records of the Swiss parliament to format them in an amenable way for further research. To tackle this task, we have developed a pipeline comprising the following steps, as illustrated in Figure 1:

  1. A preprocessing step to clean the pages, correctly order the text lines, identify separating lines for further layout analysis, etc.
  2. A general methodology for element classification, that assists the annotator during the labelling processing of the training. This methodology allows to, by labelling only a small subset of all the pages in the corpora (more than 200.000), obtain a prediction for all elements (text boxes) of the dataset, as it is required for the complete extraction of the structured information. The implemented classification dashboard just extracts some features from the XML files associated to the PDF documents, and from the text of each tex box. These features are used by the classification method to first suggest labels during the annotation, in order to accelerate the process. And second, once finalized the training, to perform the prediction for all elements’ labels.
  3. Using the predicted labels for all elements, i.e. text boxes, of all corpora pages, a post processing step aims at grouping together paragraphs to form the following three types of blocks: speeches, laws and votes. These are the three main entities that comprise our structured dataset.

The information extracted from the documents, grouped in the three aforementioned entities, is fed into a graph database, together with some extra metadata such as: bill being discussed, demographic information of parliament members (party, canton, age, gender, …), legislative year, chamber, etc. The knowledge graph (KG) associated to the graph database enables a more flexible representation of the data, as the information can be easily updated as new entities are extracted from the associated text through different types of analyses. Besides, by using the query language Cypher, different stakeholders such as political and social scientists, journalist, etc., can better parse the information, perform network dynamics analysis and/or predictions on the graph, check relevant historic information, etc.

Extracting the information from the PDF documents is one of the main aims of the current project, as it will allow already leading to a unique database, due to its depth and time span, and the possibilities it offers for historic and time series analyses. Nevertheless, once structured the information and fed into the KG, there are several additional analyses that will allow us answering varied research questions:

  • By performing topic modelling, through methods such as dynamic latent Dirichlet allocation, we lead to a meaningful set of detailed topics, and to an explanation of each document as a set of specific topics. Besides, we capture the temporal evolution of the different topics discussed, their composition and how different terms gain or lose importance throughout time, etc. By integrating this information into the knowledge graph, related to specific documents and/or bills, we intend to answer the following research questions: how does the relation of specific political parties with concrete topics (e.g. army, ecology, international relations, etc.) change over time? What is the profile of member of parliament (MP) supporting specific bills depending on its topics? Which were the historical events that have the largest influence on the course of the Swiss parliament?
  • Techniques for name entity recognition will allow complementing further the KG with additional entities related to locations and organizations. Then, it will be possible to pose interesting research questions such as: which is the relation of specific political parties or MPs with different companies and/or sectors? Which are the Swiss cantons mostly cited in the chambers, and by which parties?
  • By using different graph data science methods on the KG, we can also explore a varied sets of questions. For example, using methods for community detection, we can understand the main features that characterize different parties or MPs, and which are the main traits that cluster them together, or not. Through an analysis of the relations in the graph, networks of political support could be extracted, helping to analyze how in the Swiss parliament MPs from different political parties support each other on different bills, depending on their interests.
  • The analysis of the speeches associated to each MP and bill will enable investigating a varied set of research questions. So far, we have explored the extraction of a populism index from the speeches, using semantic role labelling. This has allowed identifying excerpts of politicians that can be clearly tagged as populists. These results are quite interesting, as populism is mostly considered a phenomena of recent years. Besides, this is just an example of how fine grained information can be extracted from the speeches, and then it is possible to use it to further enrich the graph and answer specific research questions.

These are just some of the examples of different analyses that could be carried out thanks to the depth, extension and richness of the generated structured dataset. Other natural language tools could be applied to study several different phenomena, making the dataset the perfect source for researchers on historical, political and social phenomena. Besides, by structuring the data, and releasing it publicly, we are continuing the effort of providing easy access to what the Swiss parliament is doing and how it relates to historic decisions, which helps having better-informed citizens, further improving democracy.

Publications

  • Luis Salamanca, Laurence Brandenberger, Fernando Pérez-Cruz, Frank Schweitzer. (2021) Towards a Parliamentary Database—Processing over 40.000 PDF Documents and Structuring their Content. In preparation

  • Daria Izzo, Luis Salamanca, Laurence Brandenberger, Fernando Pérez-Cruz, Frank Schweitzer. (2021) Analysis of Populism in the Swiss Parliament throughout history. In preparation