## Preprocessed graphs We provide several versions of the preprocessed `MAG-Scholar` graph. They are stored in a compressed `.npz` format using `numpy.savez_compressed`. If disk space is not an issue you can speed up the loading time of the graph by re-saving the graph in a non-compressed npz format using `numpy.savez`. We provide two version of the graph which we use in the paper: * `mag_coarse.npz` where we have 8 coarse-grained labels `['bio', 'bus', 'chm', 'eng', 'hum', 'med', 'phy', 'soc']` corresponding to 8 broad subject areas. * `mag_fine.npz` where we have 253 fine grained labels `['eng_datamininganalysis', 'phy_discretemathematics', 'med_audiologyspeechlanguagepathology', ...]` corresponding to more fine-grained subject areas. Both versions of the graph have been standardized, i.e. we only keep nodes that belong to the largest connected component, and we make the graph undirected. Non-standardized versions can be provided upon request. Some topics are listed under two broad subject areas by Google Scholar, e.g. `eng_architecture` and `soc_architecture` since we could not reliably decide to which subject area to assign them these topics are excluded from the coarse-grained version of the dataset. Therefore, `mag_fine.npz` has a slightly larger number of nodes (around `12.4 M`) compared to `mag_coarse.npz` (around `10.5`). The `class_names` field of the `SparseGraph` object contains the mapping from label ID to the its name, i.e. `[1->chm, 2->hum]`. ## Preprocessing script You can use the `mag_preprocessing.ipynb` notebook to preprocess the raw MAG data into an attributed graph with "ground-truth" node labels. The labels are obtaining by mapping publishing venus (i.e. journals and conferences) to topics / subject areas based on the categorization by Google Scholar. The list of topics which we manually extracted are stored in `google_scholar_topics.txt` (see notebook for more details). Note that some of the topics might have changed (they have different ranking and/or no longer exist) since the data was originally processed which can affect the end result. ## Raw MAG data We directly downloaded the raw MAG data from Microsoft. See [this link](https://docs.microsoft.com/en-us/academic-services/graph/get-started-setup-provisioning) for more details. This data and the derivatives which we obtain are available under the Open Data License: [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). The MAG repository is continuously updated, and we obtained a snapshot of the data on 25.01.2019. The raw data contains several files which serve as input to the preprocessing script. We do not provide these files. Use the above link to download the latest version, or contact Aleksandar Bojchevski (`a.bojchevski@in.tumn.de`) if you would like to obtain access to the exact same snapshot that we used. The relevant files are: * `Journals.txt` and `ConferenceSeries.txt` contain data about the publishing venues. This data is matched to Google Scholar. * `Papers.txt` contains data about each paper and its venue. * `PaperReferences.txt` contains the citation data (i.e. edges in the graph). We use it to construct the sparse adjacency matrix. * `PaperAbstractsInvertedIndex.txt` contains a mapping between paper IDs and the words in a paper's abstract. We use it to construct the sparse attribute matrix. ## Citation Please cite the following paper if you use this data: ``` @inproceedings{bojchevski2020pprgo, author = {Aleksandar Bojchevski and Johannes Klicpera and Bryan Perozzi and Amol Kapoor and Martin Blais and Benedek R{\'{o}}zemberczki and Michal Lukasik and Stephan G{\"{u}}nnemann}, title = {Scaling Graph Neural Networks with Approximate PageRank}, booktitle = {Proceedings of the 27rd {ACM} {SIGKDD} International Conference on Knowledge Discovery and Data Mining}, year = {2020}, publisher = {{ACM}}, url = {https://arxiv.org/abs/2007.01570}, } ```