Navigation and service

Thursday, 9 May 2023

Thursday, 9 May 2024: The German National Library will be closed at both locations. The exhibitions of the German Museum of Books and Writing will open from 10:00 to 18:00.

Wednesday, 22 May 2024

Wednesday, 22 May 2024: The German National Library in Leipzig will be closed due to a staff outing. The exhibitions of the German Museum of Books and Writing will open from 10:00 to 18:00.

Call for Participation: Twitter Datasprint

Datasprint

Do you have research questions for which you are keen to analyse large volumes of German-language tweets? Are Twitter data of interest for your research in the humanities or social, natural or life sciences? Or do you have a passion for visualising social media data?

On 21 and 22 March 2024, the German National Library in Frankfurt am Main will be offering a remarkable opportunity: in cooperation with GESIS and the NFDI consortia BERD@NFDI, KonsortSWD, NFDI4Culture, NFDI4Data Science, NFDI4Memory, Text+ and KDH UB HU Berlin,we would like to invite you to a two-day data sprint for which we will be providing three unique, extensive corpora of Twitter data.

These could be used for numerous types of methodical and analytical processing. You will find a few examples in the GESIS-Blog.

Where the legal situation permits, the organising institutions will disseminate the results through their websites and mailing lists, thus making them visible in the relevant communities. Scenarios in which the results are reused by the organisers are also possible and desirable. Small prizes will be awarded for all projects.

Application

Anyone who already works with social media data for research purposes or is planning to do so can apply to take part, as can creatives, developers, librarians, archivists and (media) artists. Please provide a brief description of your specific research question, what you are planning to do (e.g. catalogue data, create derivatives, topic models for specific hashtags), your motivation for taking part, your expertise or professional background, and the relevant skills you can contribute. You are requested to use the application form for this purpose.

The data sprint will start with an ideas pitching session followed by team-building. Attendees can also decide on the spot whether to take part in projects being realised by other participants.

Please complete the application form by 14 December 2023.

You will be informed whether your application has been accepted by 22 December 2023.

Structure

Thursday, 21 March 2024

10:00 Welcome

10:15 Ideas pitching

10:45 Team Building

12:00 Lunch break

13:00 Start Datasprint
(until 22:00 maximum)

Friday, 22 March 2024

from 9:00 Continue Datasprint

12:00 Lunch break

13:00 Presentation of results & discussion

15:00 End of event

The data

Two of the corpora contain German-language Twitter data dating from 2006 – 2011 and 2014 - 2023, while the third corpus consists of a one-percent random sample of tweets spanning a 10-year period:

Corpus 1: 2006 – 2011. The corpus encompasses approx. 220 million tweets posted between March 2006 (platform launch) and June 2011. The data were collected using a search function that filtered all tweets labelled by Twitter as German-language. The corpus contains all the metadata that were available for each tweet via the Twitter API. The data are stored in multiple files in JSONL (line-oriented JSON, one tweet per line).

Corpus 2: 2014 – 2023. This corpus contains approx. 2 billion German-language Twitter data that were collected in real time with no content filtering. The data were collected using the Scheffler criteria (2014), i.e. tweets which contain German function words ('und', 'sie', 'dass') and pass through a language filter. Besides the text, the corpus only contains individual metadata, i.e. the tweet and user ID, the date and time of posting, the reply-to ID and (for the majority of the data) the geographic coordinates. The corpus thus constitutes a representative cross-section of all German-language tweets between July 2014 and mid-March 2023. The data are available in CSV files (one tweet per line, metadata in columns).

Corpus 3: 2013 – 2023. TweetsKB is a Twitter archive based on Twitter's 1% random sample API that contains a total of 14 billion tweets together with their metadata. Along with the texts and metadata, which are available in JSON format, the archive also contains annotated features such as entities and sentiments.

The institutions organising the event will gladly compile data subsets that are tailored to your research questions. It is also possible to create specific derivatives or pre-processing steps for the data (e.g. tokenisation, n-grams) as well as compilations of tweets (e.g. for one or more hashtags, a list of accounts, extraction of hashtags, links etc.). Please note relevant requirements in the application form.

Mentors with in-depth knowledge of the datasets and various programming languages will supervise and assist participants during the event. The number of participants is limited.

Condition for participation

  • Because of the legal framework conditions, work on and with the corpora may only be carried out in the rooms and on the equipment of the German National Library. These limitations may mean that certain results obtained during the data sprint cannot be published (to be determined in each individual case)
  • 14 computer terminals will be available for the data sprint. However, up to two participants can work together at each terminal. Please note in the application form whether you are applying as a team or willing to work with a team.
  • Any open source software and computing capacity required for this work can be provided subject to availability. Please specify your requirements in as much detail as possible in the application form so that the appropriate resources can be pre-installed as necessary.
  • Virtual Linux machines will be made available. Wi-Fi will be available for the participants' own devices; for legal reasons, however, the computers used to work with the Twitter data will have no internet connection.
  • Participants who do not have their own means of transport may apply for up to 300 euros to cover travel and accommodation costs.

Last changes: 30.11.2023

to the top