2017/07/07 14:30-15:20 An approach for big data variable selection and classification

TitleAn  approach for big data variable selection and classification

Speaker:Professor Shaw-Hwa Lo (Columbia University)

Time:106/7/7/(Fri.)  14:30-15:20

Location:Room 427, Assembly Building, NCTU


Current practices toward prediction problems generally include using a significance-based criterion for evaluating variables to use in a chosen model and evaluating variables and models simultaneously for prediction, using cross-validation or independent test data. Our recent works showed that significant variables may not necessarily be predictive, and that strong predictors may not appear statistically significant at all. This left us with an important question: how can we find highly predictive variables then, if not through a guideline of statistical significance? To respond, we suggest a “Partition Retention (PR)” approach, for handling general big data variable selection and classification (prediction) problems. PR alters standard statistical practice in big data analysis, switching from significance-based modeling to seeking variables with high predictivity, a novel parameter of interest. We introduce the I-score, a statistic that can select variables sets with very high prediction rates and is closely related to a very useful lower bound of the predictivity.

There are diverse scientific applications for which the PR approach would be useful, for example in formulating predictions about diseases with high dimensional data, such as gene datasets, in the social sciences for text prediction or financial markets predictions; in terrorism, civil war, elections and financial markets. We’re hoping this opens up a new field of work that would focus on designing new statistics that measure predictivity.

Organizer:NCTU Big Data Research Center

Co-organiser:NCTU Institute of Statistics、NTHU Institute of Statistics