SML: Resources

STATISTICS 241B, COMPUTER SCIENCE C281B

Datasets

Here's a very incomplete and short list of datasets. This is really just to get you started and I encourage you to think beyond the scope of pre-made datasets.

  • Yahoo webscope datasets. There are plenty of them free for download. However, you need to sign up individually since the datasets typically come with noncommercial restrictions.

  • Netflix challenge data is not officially available any more. However, a quick web search will help you locate it.

  • IMDB data

  • Twitter gardenhose

  • AOL query log

  • GigaDB bioinformatics database. Try e.g. searching for homo sapiens.

  • TREC datasets (text retrieval).

  • Linguistic Data Consortium homepage

  • Stanford Social Networks datasets

  • Frequent itemset mining data

  • Wikipedia

Some problems

  • Design a streaming algorithm to find frequent items. Note that the distribution might change over time. A possible strategy is to modify the a-priori algorithm.

  • Use secondary information to improve collaborative filtering, e.g. for the Netflix problem you could incorporate IMDB and Wikipedia.

  • Financial forecasting as a high-dimensional multivariate regression problem. E.g. you could try predicting the price of a very large of securities at the same time. Possibly using news, tweets, and financial data releases to improve the estimates beyond a simple technical analysis.

  • Detect trends e.g. in the Tweet stream. Forecast tomorrow's keywords today. How quickly can you detect new events (earthquakes, assassinations, elections)?

  • Nonlinear function classes. Can you find efficient sets of basis functions that are both fast to compute and sufficiently nonlinear to address a large set of estimation problems.

  • Parallel decision trees. Can you design a data parallel decision tree / boosted decision tree algorithm? The published results are essentially sequential in the construction of the trees. One suggestion would be to take the Random Forests algorithm, re-interpret it as a Pitman estimator sampling from the version space of consistent trees, and then extend it to other objectives