When is there “hidden structure in data” to be discovered?

Apr 4 '13

Michael Collins sent along the following announcement for a talk:

Fast learning algorithms for discovering the hidden structure in data

Daniel Hsu, Microsoft Research

11am, Wednesday April 10th, Interschool lab, 7th floor CEPSR, Columbia University

A major challenge in machine learning is to reliably and automatically discover hidden structure in data with minimal human intervention. For instance, one may be interested in understanding the stratification of a population into subgroups, the thematic make-up of a collection of documents, or the dynamical process governing a complex time series. Many of the core statistical estimation problems for these applications are, in general, provably intractable for both computational and statistical reasons; and therefore progress is made by shifting the focus to realistic instances that rule out the intractable cases. In this talk, I’ll describe a general computational approach for correctly estimating a wide class of statistical models, including Gaussian mixture models, Hidden Markov models, Latent Dirichlet Allocation, Probabilistic Context Free Grammars, and several more. The key idea is to exploit the structure of low-order correlations that is present in high-dimensional data. The scope of the new approach extends beyond the purview of previous algorithms; and it leads to both new theoretical guarantees for unsupervised machine learning, as well as fast and practical algorithms for large-scale data analysis.

I won’t have a chance to attend, but I had the following question or thought (beyond the question of why he capitalized the words Hidden, Latent, Allocation, Probabilistic, Context, Free, and Grammars, but not mixture, models, algorithm, or data analysis). My thought had to do with the potential application of these ideas to political science. The key seems to be in the title: “the hidden structure in data.”

Some problems don’t have much hidden structure: for example, if you try to model vote choice or public opinion given individual and group-level variables, you get pretty much what you’d expect: liberals mostly vote for Democrats, conservatives mostly vote for Republicans, demographic variables such as age, education, marital status, income, and religion predict in the way you might imagine, and so forth. And these predictors are related to each other in various expected ways. So I don’t think there’s any structure of low-order correlations to be found. There are some subtleties in the data (for example, the notorious interaction between individual and state income in predicting vote), but it’s hard to imagine that the methods described above would be good tools for studying these patterns.

But other data in political science are indeed characterized by low-dimensional structure. The most prominent example is legislative roll-call voting, but there should be lots of other examples. For instance, anything having to do with international relations or trade will be structured by the patterns of connections between countries. Or personal networks, for example a study of covert terrorist organizations using external data, or a model of hidden links between financial organizations.

I wonder if the methods discussed by Hsu in his talk come with guidelines for when they can most effectively be applied. Or maybe he happens to focus on applications with hidden structure?