Resources

Datasets

Here’s a very incomplete and short list of datasets. This is really just to get you started and I encourage you to think beyond the scope of pre-made datasets.

Yahoo webscope datasets. There are plenty of them free for download. However, you need to sign up individually since the datasets typically come with noncommercial restrictions.
Netflix challenge data on Kaggle.
IMDB data
Twitter gardenhose
AOL query log
GigaDB bioinformatics database. Try e.g. searching for homo sapiens.
TREC datasets (text retrieval).
Linguistic Data Consortium homepage
Stanford Social Networks datasets
Frequent itemset mining data
Wikipedia dump
Amazon AWS public datasets

Some problems

Design a streaming algorithm to find frequent items. Note that the distribution might change over time. A possible strategy is to modify the a-priori algorithm.
Use secondary information to improve collaborative filtering, e.g. for the Netflix problem you could incorporate IMDB and Wikipedia.
Financial forecasting as a high-dimensional multivariate regression problem. E.g. you could try predicting the price of a very large of securities at the same time. Possibly using news, tweets, and financial data releases to improve the estimates beyond a simple technical analysis.
Detect trends e.g. in the Tweet stream. Forecast tomorrow’s keywords today. How quickly can you detect new events (earthquakes, assassinations, elections)?
Nonlinear function classes. Can you find efficient sets of basis functions that are both fast to compute and sufficiently nonlinear to address a large set of estimation problems.
Parallel decision trees. Can you design a data parallel decision tree / boosted decision tree algorithm? The published results are essentially sequential in the construction of the trees. One suggestion would be to take the Random Forests algorithm, re-interpret it as a Pitman estimator sampling from the version space of consistent trees, and then extend it to other objectives