Resources
Datasets
Here’s a very incomplete and short list of datasets. This is really just to get you started and I encourage you to think beyond the scope of pre-made datasets.
- Yahoo webscope datasets. There are plenty of them free for download. However, you need to sign up individually since the datasets typically come with noncommercial restrictions.
- Netflix challenge data on Kaggle.
- IMDB data
- Twitter gardenhose
- AOL query log
- GigaDB bioinformatics database. Try e.g. searching for homo sapiens.
- TREC datasets (text retrieval).
- Linguistic Data Consortium homepage
- Stanford Social Networks datasets
- Frequent itemset mining data
- Wikipedia dump
- Amazon AWS public datasets
Some problems
- Design a streaming algorithm to find frequent items. Note that the distribution might change over time. A possible strategy is to modify the a-priori algorithm.
- Use secondary information to improve collaborative filtering, e.g. for the Netflix problem you could incorporate IMDB and Wikipedia.
- Financial forecasting as a high-dimensional multivariate regression problem. E.g. you could try predicting the price of a very large of securities at the same time. Possibly using news, tweets, and financial data releases to improve the estimates beyond a simple technical analysis.
- Detect trends e.g. in the Tweet stream. Forecast tomorrow’s keywords today. How quickly can you detect new events (earthquakes, assassinations, elections)?
- Nonlinear function classes. Can you find efficient sets of basis functions that are both fast to compute and sufficiently nonlinear to address a large set of estimation problems.
- Parallel decision trees. Can you design a data parallel decision tree / boosted decision tree algorithm? The published results are essentially sequential in the construction of the trees. One suggestion would be to take the Random Forests algorithm, re-interpret it as a Pitman estimator sampling from the version space of consistent trees, and then extend it to other objectives