SML: Resources
STATISTICS 241B,
COMPUTER SCIENCE C281B
Datasets
Here's a very incomplete and short list of datasets. This is really
just to get you started and I encourage you to think beyond the scope
of premade datasets.
Some problems
Design a streaming algorithm to find frequent items. Note that the
distribution might change over time. A possible strategy is to
modify the apriori algorithm.
Use secondary information to improve collaborative filtering,
e.g. for the Netflix problem you could incorporate IMDB and
Wikipedia.
Financial forecasting as a highdimensional multivariate regression
problem. E.g. you could try predicting the price of a very large of
securities at the same time. Possibly using news, tweets, and
financial data releases to improve the estimates beyond a simple
technical analysis.
Detect trends e.g. in the Tweet stream. Forecast tomorrow's keywords
today. How quickly can you detect new events (earthquakes,
assassinations, elections)?
Nonlinear function classes. Can you find efficient sets of basis
functions that are both fast to compute and sufficiently nonlinear
to address a large set of estimation problems.
Parallel decision trees. Can you design a data parallel decision
tree / boosted decision tree algorithm? The published results are
essentially sequential in the construction of the trees. One
suggestion would be to take the Random Forests algorithm,
reinterpret it as a Pitman estimator sampling from the version
space of consistent trees, and then extend it to other objectives
