Datasets

Every project will need a rich dataset to work with for learning and implement data-driven solutions. Below are some examples of open-source datasets that could be leveraged for the course project.

Strategies & Resources for Finding Research Datasets

There are many resources for finding high-quality open-source datasets for your course project. One strategy is to search for recent research papers (e.g., within the last 3 years) on your topic of interest published in Nature venues like Nature Biomedical Engineering, Nature Medicine, npj Digital Health, given the data sharing policy of such journals most papers are required to share a link to their dataset under the "Data Availability" section of the paper. This strategy can help you find recent high-quality datasets to do interesting and impactful research.

Other example resources are:

  1. IEEE DataPort

  2. Kaggle Datasets

  3. NIH Data Sharing Resources

Note: When you find a dataset that you think is fitting for your research, it is your responsibility as the researcher to immediately evaluate the quality of the dataset. Do not make the often incorrect assumption that the dataset is high-quality. Some important criteria for evaluating datasets include:

  1. Volume: The dataset cannot be too small because this will limit the types of questions you can ask and answer.

  2. Sparsity: The dataset cannot be too sparse because this will make it challenging to build robust learning models.

  3. Age: The dataset cannot be too old (e.g., over 5 or 10 years old) because that means it has existed for a while and other researchers have likely exhausted its use/benefit (i.e., it may be difficult for you to find new questions to ask and answer with such a dataset).


Example Datasets

  • A starting list of example datasets can be found here.

    • However, students can choose from this list or find their own preferred dataset for the course project.