Resources

Introduction to Machine Learning - 10-701/15-781

Datasets

Here's a very incomplete and short list of datasets. This is really just to get you started and I encourage you to think beyond the scope of pre-made datasets.

Some problems

  • Design a streaming algorithm to find frequent items. Note that the distribution might change over time. A possible strategy is to modify the a-priori algorithm.

  • Use secondary information to improve collaborative filtering, e.g. for the Netflix problem you could incorporate IMDB and Wikipedia.

  • Financial forecasting as a high-dimensional multivariate regression problem. E.g. you could try predicting the price of a very large of securities at the same time. Possibly using news, tweets, and financial data releases to improve the estimates beyond a simple technical analysis.

  • Detect trends e.g. in the Tweet stream. Forecast tomorrow's keywords today. How quickly can you detect new events (earthquakes, assassinations, elections)?

  • Nonlinear function classes. Can you find efficient sets of basis functions that are both fast to compute and sufficiently nonlinear to address a large set of estimation problems.

  • Parallel decision trees. Can you design a data parallel decision tree / boosted decision tree algorithm? The published results are essentially sequential in the construction of the trees. One suggestion would be to take the Random Forests algorithm, re-interpret it as a Pitman estimator sampling from the version space of consistent trees, and then extend it to other objectives