DemocraSci

A research platform for Data-Driven Democracy Studies in Switzerland

Abstract

Both social sciences and humanities are currently shifting from classical research methodologies (such as surveys or close reading) to the adoption of data science techniques. However, the emerging research areas of social data science and digital humanities are still impeded by a lack of easily accessible, structured data. At the same time, large amounts of valuable records are stored in archives and libraries, but are often stored in formats that are not suitable for data- driven research. Efforts to digitize and structure these records are often undertaken in an improvised and isolated way – in other words, the wheel is reinvented for every such project. An example of this would be the case of the Swiss parliament archives. These compile all the Parliament proceedings since 1890 until now, and therefore constitute an extremely valuable corpora of information for political scientists. However, despite the documents are digitized, it is still quite difficult for researchers to extract comprehensive and exhaustive information from it.

Read the article about this project on our blog:

A Trip Through Swiss Politics and History

Description

Problem:

  • Develop a scalable and re-usable data processing chain to extract structured information from archival records,

  • Apply it to a large corpus of scanned proceedings of the Swiss parliament spanning 125 years of Swiss history, which is made available by the Swiss Federal Archive.

  • Develop user-friendly, interactive data analysis and visualization tools to promote the use of the resulting data set by political scientists and the public.

Proposed approach:

A three steps work-flow where we will tackle the following problems:

  • Data preprocessing: from the layout analysis to the entity extraction. The product of it will a structured database of the parliamentary proceedings.

  • Natural language analysis: for topic modeling, named entity disambiguation, etc.

  • Knowledge graph construction: all the previous results will enable the construction of a knowledge graph which – in the context of the parliamentary proceedings addressed in this project – links entities such as members of parliament, political parties and fractions, committees, Swiss cantons and cities, policy topics, and legislative processes. This will allow political researchers a better parsing of the information, network dynamics analysis, predictions on the graph, etc.

More insights on each of these steps are described in the section State of the project

Impact:

The resulting research platform will be of great value to political scientists, historians, social scientists, and computer scientists. It will create new avenues for data-driven research on topics like political polarization, party cohesion, government formation, strategic behavior, political representation, and party formation. It will allow historians to reconstruct a quantitative account of Swiss political history over the last 125 years. It will enable sociologists to link changing fault lines in the Federal Assembly to shifts in socioeconomic factors. It will provide resources for data-driven journalists. And it will give computer scientists a multi-lingual ground truth dataset, with possible applications in opinion mining and machine translation. Besides, the data processing chain developed to extract structured information from unstructured, scanned records is of interest beyond political science. We see great potential, for example in the processing and analysis of medical records in health applications and the mining of historical documents in digital humanities. Such methods are of growing importance for researchers in the ETH domain and the project will thus foster the SDSC’s attractiveness for those researchers

State of the project

In the present section we provide more details on the different steps tackled so far, and the different methods and techniques used.

Data preprocessing

This step comprises the processing of the “Amtliches Bulletin”, the main document released by the Parliament archives where most of the speeches were compiled. In mid 2019 we discovered that, from 1921 to 1970, not all the speeches can be found there, but rather on the “Additional protocols” documents, which required the addition of some extra steps to also integrate this information into the corpora.

  • Text extraction: from the original pdf, obtain an xml file containing the extracted text. Since the document were already OCRs, this could be easily done with the Python package “pdfminer.six”. Still, we found some errors in these xml files, specially in early years, which made us proceed with a more detailed preprocessing of these files.

  • XML enrichment: by using the xml files extracted from the original pdf document, we proceed with several steps of correction and enrichment of these xml files. We correctly reorder sentences (textlines), and group them into text boxes. We use different methods to look for margins, horizontal and vertical lines use as dividers. Besides, we implement some first easy rules to differentiate between elements in the page, such as footnotes, text in first or second column, headers, etc. The result of tis step is another xml file, enriched and corrected, that will be the one used in consequent steps. We implemented custom Python scripts to perform the aforementioned tasks.

  • Layout analysis: initially we focus on the extraction of the speeches, as they form the majority of the text in the corpora, and we needed them for starting investigating next steps. Currently, we are developing a supervised approach that, by using different features extracted from the xml file (font size and type, position of text, number of characters of text box, etc.), and a set of hand labeled data, allows a better identification of different elements in the document. WORK IN PROGRESS

  • Entity identification: in principle, we mainly need to extract the names from the text, essentially those that are not from politicians and henceforth cannot be extracted matching from the list of members of the parliament. Besides, the extraction of locations and organizations would of interest at some point, and might constitute another layer of information in the envisaged knowledge graph. For solving the current step, we are using BERT, that implements recent Transformers architectures, and for which we can find both multilingual and German language models. We are using data from both the European Newspaper and ConLL2013 datasets to first train the model only for German text, but we still need a large enough set of labels coming from the DemocraSci dataset, in order to properly fine tuned the model. Once the method will provide accurate enough results, we will extend the problem to the multilingual setup. WORK IN PROGRESS

  • Integration of additional protocols document into main dataset: first efforts to integrate all the information in both the Additional protocols and the Amtliches bulletin, to finally gather together all speeches from politicians since 1921 (before many of the speeches are lost forever and only those given during the discussion of laws that went into the referendum stage got compiled in the Bulleting). The additional protocols are documents much less structured, which requires a further processing of them to properly match different pieces of text with each of the specific laws being discussed. WORK IN PROGRESS

Natural language analysis

  • Preprocessing of text: standard NLP preprocessing such as tokenisation, POS-tagging, lemmatisation, etc. Mainly carried out with the Python packages NLTK and Tmtoolkit.

  • Topic modeling: we first investigated the usability of the Latent Dirichlet Allocation for performing topic modelling. We had to first assess qualitatively the results, in order to tune the method to the current corpora. We performed this for 2 legislative periods, and additionally for longer time spans, aiming at capturing in the later historical events that were shaping the discussions of politicians in the parliament. The results led to a meaningful set of detailed topics, and to an explanation of each document as a set of specific topics. Besides, we have also made use of Dynamic LDA models to capture the temporal evolution of the different topics discussed, their composition and how different terms gain or lose importance throughout time, etc. In both cases, i.e. the static and dynamic scenarios, we will still need to run these method on the full dataset, because thus far we have just processed the german text. TO BE RESUMED IN THE FUTURE

  • Name entity disambiguation: we are waiting for more and better results from the entity identification before proceeding with this task.

  • Multilingualism: in the present corpora we can find text in German, French and Italian, being German the most predominant one. But in order to perform many of the analyses, such as topic modeling or sentiment analysis (likely foreseen for a second stage), we need all the text in a common language. Though initially we considered language embedding approaches, that allow mapping all vocabularies into a common latent space, in order to properly capture semantics, we finally decided to make use of Neural machine translation methods, and more specifically Fairseq, an up-to-date NMT method with some pre-trained models. Still, we could not find the required models for translating from IT to DE and FR to DE, and therefore we carried out the training of these models using different parallel corpora. We already obtained some first satisfactory results, but we halted this task and we will resume it in the future to get better performing models (by just training more complex architectures for longer). TO BE RESUMED IN THE FUTURE

Knowledge graph construction

  • Population of knowledge graph: populate the database with all currently extracted entities. The knowledge graph is being constructed using Neo4J, where we can easily make queries using the Cypher query language. Besides, Neo4J can communicate with Python through Py2Neo, as well as allow the extraction of the queried graphs. We will continue enriching the graph with all newly extracted entities.

Publications

  • Luis Salamanca, Laurence Brandenberger, Fernando Pérez-Cruz, Frank Schweitzer. (2021) Towards a Parliamentary Database—Processing over 40.000 PDF Documents and Structuring their Content. In preparation

  • Daria Izzo, Luis Salamanca, Laurence Brandenberger, Fernando Pérez-Cruz, Frank Schweitzer. (2021) Analysis of Populism in the Swiss Parliament throughout history. In preparation