| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| Readme.md | 2024-04-02 | 2.3 kB | |
| EpiBank-Pak.csv | 2024-04-02 | 677.5 kB | |
| EpiBank-Geo.csv | 2024-04-02 | 656.9 kB | |
| EpiBank.csv | 2024-04-02 | 484.0 kB | |
| Totals: 4 Items | 1.8 MB | 0 |
EpiBank Dataset
The repository contains the EpiBank dataset developed for epidemic surveillance of six diseases: COVID19, Flu, Hepatitis, Dengue, Malaria, and HIV/AIDs. Overall, the dataset contains 271 million English tweets related to diseases collected over the period of six months (March 2020-August 2020). We also develop two datasets of EpiBank-Global and EpiBank-Pak from EpiBank by mapping the geo-lcoation of tweets. EpiBank-Global contains 86 million tweets mapped to 190 countries. On the other hand, EpiBank-Pak consists of 2.6 million tweets mapped to 346 cities and locations of Pakistan.
The repository contains three files: EpiBank.csv, EpiBank-Global, and EpiBank-Pak. EpiBank contains the following fields:
-
Tweet_id: The field represents the Twitter provided id of a tweet
-
DiseaseCategory: It provides the names of diseases associated with the tweet. A tweet can contain multiple disease labels.
-
Hashtags: This field provides the list of hashtags mentioned in the text of the tweet.
-
UserCategory: The field provides the label of bot or human assigned to the user. 0 value represents the human and 1 represents the bot account.
-
Sentiment: The overall sentiment of tweet computing using the Valence Aware Dictionary and sEntiment Reasoner (VADER) algorithm.
-
Fakeness: The field represents the label fake or real assigned to the tweet. 0 shows the real and 1 represents the fake tweets. It must be noted that the fakeness value is only assigned to the tweets related to COVID19 disease.
-
GeoLocation: The field is available in EpiBank-Global and EpiBank-Pak datasets. The field provides the list of geo-locations assigned to the tweet.
Due to Twitter terms of services, only tweet ids are shared for the public. Tweet IDs can be hydrated using tools. Few tools are listed are as follows:
-
DocNow Hydrator :https://github.com/DocNow/hydrator
-
CrisisNLP Tool: https://crisisnlp.qcri.org/#resource8
Citation
@inproceedings{bilaltahir,
title={Leveraging Social Computing for Epidemic Surveillance:A Case Study},
author={Bilal Tahir and Muhammad Amir Mehmood},
journal={Journal of Big Data Research},
year={2024}
}