# Overview This repository contains several datasets pertinent to the paper (https://arxiv.org/pdf/1812.00474.pdf): Lemmerich, Florian, Diego Sáez-Trumper, Robert West, and Leila Zia. "Why the World Reads Wikipedia: Beyond English Speakers." Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (ACM WSDM 2019). It includes the following datasets and figures: * **country_plots.zip**: Full set of plots akin to Figure 4 in the paper, the correlation between the Human Development Index (HDI) and survey responses. * **country_plots_en.zip**: HDI vs. survey response plots, but with only English-speaking countries. * **country_plots_es.zip**: HDI vs. survey response plots, but with only Spanish-speaking countries. * **country-level_overview.csv**: For each country-language pair with sufficient data, we provide the distribution of survey responses and country-specific features used in modeling. * **es_topics.zip**: HDI and other socio-economic factors vs. each LDA topic plots, but with only Spanish-speaking countries. * **full_results.csv**: Full table of weighted survey response proportions for each language. * **pattern_plots.zip**: All pairs for Figure 3 of the paper, the relationships between Wikipedia usage patterns and survey answers. * **responses.zip**: 50% sample of raw survey responses with associated article data from that user's session for all the languages. Does not include session data (e.g., number of pages, day of week, time of day) to protect user privacy. * **topics.zip**: For every language, we provide the topics associated with our LDA models. This includes each topic as well as top keywords and top / random articles associated with it.