Skip to main content
Indiana University Bloomington

COLLOQUIUM


UPCOMING COLLOQUIA
This series is open to all Indiana University faculty and students interested in statistical research.
Time and place: Selected Mondays from 3pm to 4pm in the Indiana Memorial Union.

Please join us for our next talk.
A Reexamination of Fama-French Regressions Using High Frequency Panels

November 30, 2009 - IMU Persimmon Room 3:00 - 4:00 PM

Yoosoon Chang - Indiana University Department of Economics

This paper develops a new framework and tools to reexamine Fama-French regressions. For Fama-French portfolios, we consider a continuous-time factor model with a specific error component structure implied by the underlying asset pricing theory. The model is then analyzed as a continuous-time multivariate regression with a general martingale differential error. Our framework is broad enough to accommodate some of the important common features of the errors in this type of regressions. In particular, we allow for time-varying or stochastic volatilities that are persistent and have strong leverage effects. It is well known that such nonstandard features would make the standard inferential procedure invalid. We overcome this difficulty by using samples collected at random intervals, instead of those sampled at fixed intervals such as monthly and yearly, which are set by the clock running inversely to the market volatility. Under our sampling scheme, Fama-French regressions may simply be regarded as the classical regressions having normal errors with variance given by the averaged quadratic variation of the martingale differential error. Various tests, which have been used to evaluate Fama-French factors, are extended and evaluated in the paper.



December 07, 2009 - IMU Persimmon Room 3:00 - 4:00 PM

Chen Yu - Indiana University Department of Psychological and Brain Sciences



 


PAST COLLOQUIA


A Modified Box-Pierce Test for Conditional Goodness-of-Fit
November 16, 2009 -


Zaichao Du - Indiana University Department of Economics

In this paper, we propose a modified Box-Pierce test for conditional goodness-of-fit. Our method is based on the fact that under the correct specification of the conditional distribution the generalized errors obtained after the probability integral transformation are iid U[0,1]. Our test explicitly takes into account the parameter estimation effect, as a result it has a convenient standard chi-square limit distribution. Our test is applicable to a wide class of models, including but not limited to ARMA-GARCH model, Hansen (1994) skewed t model and autoregressive conditional duration model. A simulation study shows that our test has satisfactory size and power performance. An empirical application to the Hang Seng Index data highlights the merits of the proposed test.

Data Spectroscopy: Eigenspace of Convolution Operators and Clustering
October 26, 2009 -


Tao Shi - Ohio State University at Chapel Hill

In this talk, we focus on obtaining clustering information in a distribution when iid data are given. First, we develop theoretical results for understanding and using clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function (with a sufficiently fast tail decay). We study which eigenvectors
should be used and when the clustering information for the distribution can be recovered from the data. Second, we use heuristics from these analyses to design the Data Spectroscopic clustering (DaSpec) algorithm. Our findings not only extend and go beyond the intuitions underlying existing spectral techniques (e.g. spectral clustering and Kernel Principal Components Analysis), but also provide insights about their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the promise of our proposed data spectroscopy clustering algorithm relative to k-means and one spectral method. In particular, DaSpec seems to be able to handle unbalanced groups and recover clusters of different shapes better than competing methods. This is joint work with Prof. Mikhail Belkin (Ohio State University)and Prof. Bin Yu (University of California, Berkeley).

Modern Probability Theory and Statistical Problems in Real World Networks
October 12, 2009 -


Shankar Bhamidi - University of North Carolina at Chapel Hill

The last few years have seen an explosion in the amount of data on many real world networks. This has resulted in an interdisciplinary eff ort in formulating models to understand the data. On the theory side, we shall look at how powerful techniques in modern probability theory can be used in this context via the following three problems:

1. Reconstruction of routing trees: In a number of problems that arise from trying to discover the underlying structure of the Internet, it is often impossible to take direct measurements at the routers. We shall describe progress in trying to reconstruct the "Multicast" tree exactly using only "end-to-end" measurements. Surprisingly, using deep facts from Phylogenetics, we show that this can be done using very few samples.

2. MCMC simulation of exponential random graphs: Exponential random graphs are one of the most used models in social network theory. The basic idea is the following: In social networks we see more triangles cliques etc than we would expect in a random graph, basically because if A is a friend of B and A is a friend of C then it is quite likely that B and C are friends. One way to model such a phenomenon is to attach, for every graph G,a Hamiltonian given by say

H(G) = β#E(G) + γ#T(G)

where E(G) and T(G) are the number of edges and triangles respectively and then looking at the Gibbs distribution induced by this Hamiltonian. Simulating from these models is of paramount interest.

Using the modern day theory of Markov Chains we and in the ferromagnetic setup, exactly when one can simulate from this model effciently and when it would take exponentially long to simulate from this model.

3. Spectral distribution adjacency matrices: How good is the spectral distribution of the adjacency matrix of a network in estimating key features of the network? Given two samples of networks can we always tell them apart just by looking at their spectral distribution? These sorts of questions lead us into analyzing the asymptotics of the spectral distribution of a number of random tree models.

Architecting Scientific Data Systems in the 21st Century
September 28, 2009 -


Daniel Crichton - NASA Jet Propulsion Laboratory

Modern science research is requiring collaboration amongst geographically distributed scientists and their data assets. Because of this, a new era of scientific discovery exists in which science data is shared and validated across research institutions. More than ever, computing infrastructures that span these multiple institutions must work together to support collaborative research. At NASA’s Jet Propulsion Laboratory, we have been working in multiple disciplines to support the movement towards highly distributed scientific data systems that span, end-to-end, the data pipeline from instruments all the way to analysis. JPL developed a software technology called the Object Oriented Data Technology (OODT) framework that provides building blocks for construction of distributed data systems, addressing the challenges of the distributed, data-intensive domain. OODT has been successfully infused into planetary science, earth science and cancer research programs. This talk will explore the commonalities of developing data system infrastructures in these disciplines as well as our experiences, results and plans for data systems in the future.

Wiki-World: Implicit Translation via Graph Embeddings
April 06, 2009 -


David Marchette - Naval Surface Warfare Center Dahlgren Division

Implicit translation is the association of documents in different languages which are on the same topic, in the absence of a translation dictionary. The mult-lingual Wikipediae will be used as a test-bed to explore some techniques based on the embeddings of the Wikipedia graphs. I will discuss several standard graph embeddings, and a novel random graph embedding. These are applied to a subset of the English and French Wikipediae, showing that using the graph information alone is sufficient to obtain good results. The incorporation of the language information via word-count histograms or related bag-of-words approaches, in conjunction with the graph information, will be discussed briefly.

Flury Lecture - A Statistician Looks At Uncertainty
March 03, 2009 -


David Scott - Noah Harding Professor of Statistics, Rice University

Modern science relies on ever more complex models to understand data. Presenting the confidence of model predictions is a grand challenge. Faced with potentially hundreds or thousands of parameters, scientists often perform sensitivity analyses in order to assess the robustness of model predictions.

Such one-at-a-time calculations are useful but limited. Visualization techniques can provide a fuller picture, but the availability of immersive technologies is still expensive and not commonplace. We examine some simple data and discuss the presentation of uncertainty. Avenues for research are described.


Efficiency of the Maximum Partial Likelihood Estimator for Sampling Designs in Cox Regression Model
February 23, 2009 -


Haimeng Zhang - Mississippi State University

The Cox hazard regression model is a popular method used in epidemiological research to quantify the effects of prognostic factors on survival for a cohort of individuals followed over time. In practice, it is often difficult or expensive to collect complete data when dealing with large cohorts. Therefore, a number of practical sampling designs have been put forward. However, it is not always clear that the estimators in those designs use the given sampled data in the most efficient manner. In this presentation, I will discuss the asymptotic efficiency of the estimators from two popular sampling designs - case cohort and nested-case control sampling. In addition, by comparing the theoretical lower bound with the limiting distribution of the estimators, I will indicate in what instances the estimators achieve the lower bound, and what situations make for large efficiency losses.

Google, Data, and Statisticians
February 09, 2009 -


Jim Koehler - Google

I'll provide a glimpse into the hidden life of statisticians inside of Google. I'll start by providing a general overview of Google's advertising system and the variety of roles for statisticians. Then I'll describe some specific business problems and how statistical methods contribute to their solutions. Finally, I'll introduce Google's efforts to partner with universities through the Google Online Marketing Challenge (student competition) and the Google University Research Awards.

Modeling, Simulation, and Forecasting of Crime Patterns Using Self-Exciting Point Processes
December 01, 2008 -


George Mohler - University of California Los Angeles

Self-exciting spatial point process methods are well established in fields such as seismology, where the occurrence of an event increases the likelihood of another event nearby in space and time (in the case of earthquakes, aftershocks often follow a large event). This self-exciting behavior gives rise to particular types of data clustering and, surprisingly, such clustering is also observed in crime data (as it turns out, burglars will often return to the same house, or a house nearby, shortly after a prior offense and commit another burglary).

In this talk we show how self-exciting point processes can be used for the purposes of crime pattern modeling, simulation, and forecasting. We will first discuss the application of standard point process models, which involves background intensity estimation, maximum likelihood estimation of parameters, and model evaluation using tests for clustering. Next we will show how behavioral dynamics, present in recent agent-based models of crime, can be incorporated into the point process framework. For this purpose we use state-dependent stochastic differential equations, which can be viewed as generalizations of kernel-based models. We conclude by discussing several practical applications.

Monitoring the Golden Gate Bridge Using Wireless Sensor Networks
November 17, 2008 -


Guilherme Rocha - Indiana University Department of Statistics

A 51-node Wireless Sensor Network has been installed on the Golden Gate Bridge for Structural Health Monitoring. There is a mismatch between the rate at which the data are collected at each node (~4 Kbps, kilo-bytes/second) and the rate at which they can be transmitted (~0.5 Kbps). The ultimate goal is to develop a data reduction scheme so the WSN can perform real time monitoring of the dynamic properties of the bridge. For the temperature data (~0.80 Kbps/sensor), lossless run length coding achieves a significant reduction in the data rate (~0.04 Kbps/ sensor). For the acceleration data, our strategy is to construct a restricted parametric model for the bridge and continuously adjust it as data become available. The restrictions applied to the model reflect both physical considerations and communication constraints. We report the results of such strategies on simulated data sets. Joint work with David Culler, James Demmel, Gregory Fenves, Sukum Kim, Shamim Pakzad, and Bin Yu.

Efficient Semiparametric Detection of Changes in Trend
November 03, 2008 -


Chuan Goh - University of Toronto

     This paper proposes a test for the correct specification of a dynamic time-series model that is taken to be stationary about a deterministic linear trend function with no more than a finite number of discontinuities in the vector of trend coefficients. The test avoids the consideration of explicit alternatives to the null of trend stability. The proposal also does not involve the detailed modelling of the data-generating process of the stochastic component, which is simply assumed to satisfy a certain strong invariance principle for weakly dependent processes. As such, the resulting inference procedure is effectively an omnibus specification test for segmented linear trend stationarity. The test is of Wald-type, and is based on an asymptotically linear estimator of the vector of total-variation norms of the trend parameters whose influence function coincides with the efficient influence function.

     Simulations illustrate the utility of this procedure to detect discrete breaks or continuous variation in the trend parameter as well as alternatives where the trend coefficients change randomly each period. This paper also includes an application examining the adequacy of a linear trend-stationary specification with infrequent trend breaks for the historical evolution of U.S. real output.

Statistical Analysis of cDNA Microarray Data: Issues of Process Manufacturing, Data Transformations, and Background Noise
October 20, 2008 -


Karen Kafadar - Indiana University Department of Statistics

Microarray technology has made available large data sets that can provide information on gene expression when cells are subjected to various treatments. Before proceeding with a formal statistical analysis, many biological and procedural aspects should be considered. These aspects may guide the analysis and subsequent statistical inference. Several of these issues are discussed in connection with the analysis of oligonucleotide and cDNA microarray experiments. The particular focus in this article is on effects caused by the cDNA slide manufacturing process, appropriate transformations of the data, and on adjustments for background. A prescription for the analysis of microarray data is proposed and demonstrated using data from a cDNA experiment comparing the genetic expressions in two mouse cell lines; a candidate set of genes is identified for further study. The prescription may be modified for oligonucleotide microarray data.

Nonparametric Estimation of Variogram and its Spectrum
October 06, 2008 -


Chunfeng Huang - Indiana University Department of Statistics

In the study of isotropic intrinsically stationary spatial processes, a new nonparametric variogram estimator is proposed through its spectral representation. The spectrum estimation is formulated in terms of solving a regularized inverse problem. A numerical implementation is presented through quadratic programming. We demonstrate our method in a simulation study and a dataset of temperature changes over America.

On Riesz And Wishart Distributions Associated With Decomposable Undirected Graphs
September 15, 2008 -


Steen Andersson - Indiana University Department of Statistics

Classical Wishart distributions on the open convex cones of positive definite matrices and their fundamental features are extended to generalized Riesz and Wishart distributions associated with decomposable undirected graphs using the basic theory of exponential families. The families of these distributions are parameterized by their expectations/natural parameter and multivariate shape parameter and have a non-trivial overlap with the generalized Wishart distributions defined in Andersson and Wojnar (2004a,b). This work also extends the Wishart distributions of type I in Letac and Massam (2007) and, more importantly, presents an alternative point of view on the latter paper.

A variable weighting and selection procedure for K-means cluster analysis
April 28, 2008 -


Douglas L. Steinley, University of Missouri-Columbia

A variance-to-range ratio variable weighting procedure is proposed. The method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the weighting technique. The performance of these procedures are compared to existing methods in the literature.

Birthweight Distribution and Infant Mortality: Thinking Outside the Curve
April 24, 2008 -


Professor Richard Charnigo - University of Kentucky

Greater epidemiologic understanding of the relationships among fetal-infant mortality and its prognostic factors, including birthweight, could have vast public health implications. A key step toward that understanding is a realistic and tractable framework for analyzing birthweight distributions and heterogeneity in fetal-infant mortality. We propose describing a birthweight distribution using a normal mixture model in which the number of components is determined from the data, then estimating birthweight-specific mortality curves within each component of the normal mixture. We emphasize both methodological issues (e.g., How should the number of components be determined?) and interpretive issues (e.g., What do the components represent?). Data from the National Center for Health Statistics Public-Use Perinatal Mortality Data Files are used to compare our analytic framework to existing frameworks as well as to assess the reproducibility across repeated sampling of results obtained through our framework. (This talk is based on work with Lorie Wayne Chesnut, Tony LoBianco, and Russell S. Kirby.)

On the Frequency Polygon Estimator for Random Fields
April 07, 2008 -


Michel Carbon - Rennes University, France

The purpose of this talk is to investigate the Frequency Polygon as a density estimator for stationary random fields indexed by multidimensional lattice points in space. Optimal binwidths which asymptotically minimize integrated mean square errors (IMSE) are derived. Under weak conditions, frequency polygons achieve the same rate of optimal uniform rate of convergence under general conditions. Rates of the a.s. convergence are given too. Finally, asymptotic normality of the frequency polygon estimator is established.

Flury Lecture - Current and Future Frontiers in Statistics
March 27, 2008 - RH 100 10am

Professor Peter Hall, University of Melbourne

The availability of powerful computing equipment has had a dramatic impact on statistical methods and thinking, changing forever the way data are analysed. New data types, larger quantities of data, and new classes of research problem are all motivating new statistical methods. We shall give examples of each of these issues, and discuss the current and future directions of frontier problems in statistics.


Predictive Regression Models for Nonstationary Economic and Financial Data
March 20, 2008 -


Professor Zongwu Cai - University of North Carolina at Charlotte and Wang Yanan Institute for Studies in Economics, Xiamen University, China

Motivated by forecasting the inflation rate through nonstationary variables and efficient tests of stock return predictability as well as forecasts of the equity premium, this talk will focus on how to use nonparametric or semiparametric regression techniques to analyze nonstationary time series data. Development of a nonparametric approach to estimate the functionals will be discussed, as well as how the consistency and asymptotic normality of the proposed estimators are obtained. The asymptotic results have shown that the asymptotic bias is same for all estimators of functionals, but that the convergence rates are totally different for stationary and nonstationary covariates. These findings seem innovative in the literature.

The Additive-Interactive Nonlinear Volatility Model, Its Estimation And Some Testing Issues
March 03, 2008 -


Michael Levine - Purdue University

   We consider a new separable nonparametric volatility model that allows for “interactions” in both mean and conditional variance (volatility) function. It can be concisely described as an additive-interactive nonlinear ARCH model. We propose this model as a possible alternative to the generalized additive nonlinear ARCH (GANARCH) model of Kim and Linton (2004), with which it shares the common origin. Unlike the GANARCH model, it does not assume the known link function but includes second-order interaction terms in both mean and variance functions instead. This ensures a much more data-driven model compared to GANARCH of Kim and Linton (2004) since our assumptions do not assume that anything know about the data distribution. This is very beneficial since, in practice, the data distribution has to be selected based on the exploratory data analysis, which is very difficult for multivariate data. Thus, the proposed model is much more flexible compared to GANARCH.
   Motivated by the local instrumental variable estimation method (LIVE), also introduced in Kim and Linton (2004), we propose instrumental variable-based estimators of the components of the mean and volatility functions. The estimators are shown to be consistent and asymptotically normal. Explicit expressions for asymptotic means and variances of these estimators are also obtained. Several simulation experiments are conducted that show a very good performance of our algorithm for moderate sample sizes. Finally, the method is applied to the real data set of currency exchange rates where it leads to some interesting conclusions.
   Historically, multiple functional component testing in nonparametric models has been a fairly difficult problem. We introduce a novel F-type approach to testing the significance of the two-way interactive terms in the mean function based on the unbalanced design ANOVA with unequal variances. Simulation studies show that the method performs very well for sample sizes of about 5000, which are easily available in financial applications.

Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression
February 25, 2008 -


Professor Chao-Ying Joanne Peng from Indiana University

For the past 25 years, advances have been made in missing data methods. Most published work has focused on missing data in the dependent variable under various conditions. The present study sought to fill the void by comparing two approaches for handling missing data in categorical covariates in logistic regression. These two approaches were EM method of weights and multiple imputation.
Sample data were first drawn randomly from a population with known characteristics. Missing data on covariates were subsequently simulated under two conditions: missing completely at random and missing at random with different missing rates. A logistic regression model was fit to each sample using either the EM or the MI approach. The performance of these two approaches was assessed on four criteria: bias, efficiency, coverage, and rejection rate.
Results generally favored MI over EM. Practical issues such as implementation, inclusion of continuous covariates, and interactions between covariates were discussed.


Convergence Rates and Asymptotic Standard Errors for MCMC Algorithms for Bayesian Probit Regression
February 14, 2008 -


Vivekananda Roy University of Florida

We study Markov chain Monte Carlo algorithms for exploring the intractable posterior density that results when a probit regression likelihood is combined with a flat prior on the regression coefficient. We prove that the data augmentation algorithm of Albert and Chib (1993) and the PX-DA algorithm of Liu and Wu (1999) both converge at a geometric rate, which ensures the existence of central limit theorems (CLTs) for ergodic averages under a second moment condition. While these two algorithms are essentially equivalent in terms of computational complexity, we show that the PX-DA algorithm is theoretically more efficient in the sense that the asymptotic variance in the CLT under the PX-DA algorithm is no larger than that under Albert and Chib's algorithm. A simple, consistent estimator of the asymptotic variance in the CLT is constructed using regeneration. As an illustration, we apply our results to the lupus data from van Dyk and Meng (2001). In this particular example, the estimated asymptotic relative efficiency of the PX-DA algorithm with respect to Albert and Chib's algorithm is about 65, which demonstrates that huge gains in efficiency are possible by using PX-DA algorithm.

Applications of Empirical Likelihood to Quantile Estimation and Longitudinal Data
February 04, 2008 -


Jien Chen - University of Georgia

As a non-parametric method, Empirical Likelihood (EL) has been attracting serious attention from researchers in statistics, econometrics, engineering and biostatistics. By defining the estimation equations in EL appropriately, we can extend EL to various data settings, especially those in which parametric likelihoods are absent. In this talk, I will provide two examples of such extensions: quantile estimation and longitudinal data analysis. Quantile estimation for discrete data analysis has not been well studied. For a given 0 < p < 1, the commonly used sample quantile may or may not be consistent for the pth quantile, depending on whether or not the underlying distribution has a plateau at the level of p. I propose an EL-based categorization procedure that not only helps determine the shape of the true distribution at level p, but also provides a way of formulating a new estimator that is consistent in any case. For non-Gaussian longitudinal data, generalized estimating equations (GEE) are a popular class of marginal models. While the GEE estimator is consistent and robust, it may suffer significant loss of efficiency if the working correlation structure is misspecified. I consider the use of EL to select working correlations for GEE models, for which parametric likelihoods are absent and quasi-likelihoods are difficult to construct.

A Multivariate Semiparametric Bayesian Spatial Modeling Framework for Hurricane Surface Wind Fields
January 31, 2008 -


Brian Reich - North Carolina State University

Storm surge, the onshore rush of sea water caused by the high winds and low pressure associated with a hurricane, can compound the effects of inland flooding caused by rainfall, leading to loss of property and loss of life for residents of coastal areas. Numerical ocean models are essential for creating storm surge forecasts for coastal areas. These models are driven primarily by the surface wind forcings. Currently, the gridded wind fields used by ocean models are specified by deterministic formulas that are based on the central pressure and location of the storm center. While these equations incorporate important physical knowledge about the structure of hurricane surface wind fields, they cannot always capture the asymmetric and dynamic nature of a hurricane. A new Bayesian multivariate spatial statistical modeling framework is introduced combining data with physical knowledge about the wind fields to improve the estimation of the wind vectors. Many spatial models assume the data follow a Gaussian distribution. However, this may be overly-restrictive for wind fields data which often display erratic behavior, such as sudden changes in time or space. In this paper we develop a semiparametric multivariate spatial model for these data. Our model builds on the stick-breaking prior, which is frequently used in Bayesian modeling to capture uncertainty in the parametric form of an outcome. The stick-breaking prior is extended to the spatial setting by assigning each location a different, unknown distribution, and smoothing the distributions in space with a series of kernel functions. This semiparametric spatial model is shown to improve prediction compared to usual Bayesian Kriging methods for the wind field of Hurricane Ivan.

Designing Penalty Functions for Grouped and Hierarchical Selection
January 28, 2008 -


Guilherme Rocha - University of California, Berkeley

  Extracting useful information from high-dimensional data is an important focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task. Quasi-norms on model parameters are frequently used as a penalty. Classical examples are AIC and BIC where the L0 quasi-norm (model dimension) is used as a penalty.
  More recently, penalization by the L1-norm (lasso) has enjoyed a lot of attention. L1-penalized estimates are cheaper to compute (convex optimization) and lead to more stable model estimates than their L0 counterparts.
  In this talk, I will present the Composite Absolute Penalties (CAP) family of penalties. CAP penalties allow given grouping and hierarchical relationships between the predictors to be expressed. They are built by defining groups of variables and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for non-overlapping groups. Hierarchical variable selection is reached by defining groups with particular overlapping patterns.
  Under easily verifiable assumptions, CAP penalties are convex: an attractive property from a computational stand-point. Within this subfamily, unbiased estimates of the degrees of freedom (df) exist so the regularization parameter is selected without cross-validation.
  Simulation results show that CAP improves on the predictive performance of the LASSO for cases with p>>n and mis-specified groupings.
  This is joint work with Peng Zhao and Bin Yu.

On Large Margin Hierarchical Classification
January 14, 2008 -


Junhui Wang - Columbia University

Hierarchical classification is critical to knowledge and context management as well as knowledge exploration, as in gene function classification and discovery and document categorization. In hierarchical classification, an input is classified by a structured hierarchy. In a situation as such, the central issue is how to effectively utilize inter-class relationship to improve the generalization performance of flat classification ignoring such dependency. In this talk, a novel large margin method based on constraints characterizing multi-path hierarchy is presented within the framework of regularization. In particular, I will discuss three aspects: (1) the idea and methodology development; (2) computational tools; (3) a statistical learning theory. Numerical examples will be provided to demonstrate the advantage of our proposed methodology against other existing competitors. An application to gene function prediction and discovery will be discussed.

Higher Order Semiparametric Inference Based on the Profile Likelihood
January 10, 2008 -

Guang Cheng - SAMSI

Semiparametric modeling is an excellent framework due to its flexibility to model some features parametrically without making assumptions on the other features. However, the infinite-dimensional nuisance parameter in the semiparametric models generally poses several challenges for making maximum likelihood inference for the parameter of interest at both theoretical and methodological levels. We will consider a series of profile likelihood based semiparametric inference procedures either based on numerical methods, i.e. K-step MLE, or through MCMC sampling, i.e. the Profile Sampler and the Penalized Profile Sampler. All the above profile likelihood based methods avoid evaluation of the infinite-dimensional operator and are easy to implement. Furthermore, we investigate their second order asymptotic behaviors, which are proven to be related to the convergence rate of the nuisance parameter and thus adjustable.

Visualization Databases for Lossless Analysis of Complex Data Sets
November 26, 2007 -


William Cleveland -- Purdue University

Large, complex data sets are ubiquitous, the standard now rather than the exception. They present challenging problems of analysis because of their size and the complexity of their data structures and patterns. One approach is to compute summary statistics at the outset to reduce the complexity, but this expedient risks losing important information in the data. The goal should be lossless analysis: analyze the data at a level of detail and comprehensiveness that does not sacrifice
information.
Achieving lossless analysis of complex data today is immensely challenging. New fundamental approaches and methods are needed for each of the different areas that come into play in the analysis of the data --- databases, data processing, data structures, statistical models and methods, machine learning algorithms, data
visualization, computational algorithms, software environments, and hardware environments. In fact, it has never been harder to achieve lossless analysis because complexity has increased faster than our innovations in these areas.
Nothing serves lossless analysis better than data visualization, the only practical way to absorb large amounts of information in detail. But for today's complex sets we must visualize far larger amounts than in the past. We must be ready to accept large displays each covering tens or even hundreds of screensful (pages). For a single data set it is reasonable to have hundreds of such displays. These displays become a new database produced from the data that is queried and studied. For a display of 500 pages, we might query and study all or just a few of the pages depending on the task.
Producing, querying, and studying a visualization database needs new ideas. There are different modes of viewing the many pages and panels per page of a large display, from slow focused study to very rapid scans. We need creative interfaces to facilitate the different modes. We cannot fuss with very large displays, interacting with the micro-elements to get them right, because there is too much; instead there should be smart automation algorithms that get the large display right the first time. We must consider the physical screen space, its size and resolution, to make it work most effectively for the visual study. We need methods of display that result in pre-attentive visual formation of gestalts that show instantaneously the relevant patterns in the data. This necessitates, strangely, more displays, starting with broad brush looks to derivative displays whose redesigns show specific aspects of the broad brush more effectively. It also requires the study of visual perception.

Skewed Multivariate Distributions
November 12, 2007 -


Tonu Kollo -- Institute of Mathematical Statistics, University of Tartu

In the last ten years, remarkable development has occurred in the area of skewed multivariate distributions. Skew normal distribution was introduced in 1996 by Azzalini & Dalla Valle (Biometrika). Azzalini’s construction of the distribution was very fruitful and was later successfully applied to many other elliptical families of distributions. Random vector X is skew normally distributed with parameters α – a p-vector as the shape parameter and Σ - a positive definite p×p-matrix as the scale parameter when the density function of X is the product of the density function of N(0,Σ) and the distribution function of the standard normal distribution with the shape parameter appearing in the argument of the distribution function.
In the talk, basic properties of the skew normal family will be discussed and other more often used families of skewed elliptical distributions will be examined (multivariate skew t-distribution, for instance). With new families of distributions, new estimation and testing problems have risen. Classical estimation methods do not often work: maximum likelihood method can give wrong estimates and much bias is possible using moments’ method.
Another type of skewed multivariate distributions is presented by asymmetric Laplace distribution, which was carefully examined by T. Kozubowski in a series of papers at the end of 1990s. In this case, we do not have explicit expression for the density function and estimation, testing and fitting problems have to be solved on the basis of the characteristic function. This distribution will also be considered in more detail.

Principal Component Analysis of Forward Interest Rates
November 05, 2007 -


Victor Goodman -- Indiana University

Forward Interest rates are simultaneously measured using up-to-the-minute bond trading quotes. The time variation of these rates determines a high-dimensional covariance matrix that might be used to model bond yields within a country's government bond market.
Several PC analyses, with 1989-92 data in the U.K. market, 1887-94 data in the U.S. market, and 2001-05 data in the U.S. market, reveal a striking pattern involving the first three eigenvectors of each covariance matrix. In this talk I describe the pattern and make the (well-known) case for using a three-factor Gaussian model to describe bond trading.
It is difficult to implement models based on PC estimates for the first eigenvector. An initial attempt to produce a model in 1988 ended in failure since the model had a financial inconsistency. A more recent model behaves better; its covariance has the desired eigenvector and the model is arbitrage-free. Surprisingly, the new model appears when we condition prices not to collapse in the old model. This suggests that an issue of survivorship may arise even in no-default bond markets.

What is quantum probability theory, and how can it be used to analyze measurements in the social and behavioral sciences?
October 29, 2007 -


Jerome R. Busemeyer -- Indiana University

Social and behavioral scientists face some of the same measurement problems that forced physicists to abandon classical probability theory. Their measurements are often incompatible, and the first measurement may disturb a second measurement. Thus only partial information about a complex system can be obtained at any point in time. Combining partial information about a system into a coherent understanding of the entire system is the hallmark of quantum theory. Quantum theory provides a fundamentally different approach to logic, reasoning, and probabilistic inference. For example, quantum logic does not always follow the distributive axiom of Boolean logic; quantum probabilities do not always obey the Kolmogorov law of total probability; quantum reasoning does not always obey the principle of monotonic reasoning.
For this talk, I will present a tutorial of the basic assumptions of classic versus quantum probability theories. These basic assumptions will be examined, side by side, in a parallel and elementary manner. Classic theory will emerge as a possibly overly restrictive case of the more general quantum theory. The fundamental implications of these contrasting assumptions for measurement in the social and behavioral sciences will be examined.

Data-Driven Smooth Tests for the Martingale Difference Hypothesis
October 22, 2007 -


Juan Carlos Escanciano -- Indiana University

Abstract: A general method for testing the martingale difference hypothesis is proposed. The new tests are data-driven smooth tests based on the principal components of certain marked empirical processes that are asymptotically distribution-free, with critical values that are already tabulated. The smooth tests are shown to be optimal in a semiparametric sense discussed in the paper, and they are robust to conditional heteroscedasticity of unknown form. A simulation study shows that the data-driven smooth tests perform very well for a wide range of realistic alternatives and have more power than omnibus and other competing tests. Finally, two empirical examples highlight the merits of our approach.

Dimension Augmented Import Vector Machine (DAIVM): A new General Classifier System for Large p Small n problem, with Application in Bio-Informatics
October 01, 2007 -


Dr. Samiran Ghosh, IUPUI

Abstract: Support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. All of these approaches construct classifier based on training sample in a high dimensional space by using all available dimensions. SVM achieves huge data compression by selecting only few observations lying in the boundary of the classifier function. However when the number of observations is not very large (small n) but the number of dimensions are very large (large p) then it is not necessary that all available dimensions are carrying equal information in the classification context. Selection of only useful fraction of available dimensions will result in huge data compression. In this paper we have come up with an algorithmic approach by means of which such an optimal set of dimensions could be selected. We have reversed and modified the solution proposed by Zhu and Hastie in the context of Import Vector Machine (IVM), to select an optimal sub model by using only few observations. For large p small n domain (e.g., Bioinformatics) our method compares different trans-dimensional model to come up with optimal set of dimensions to build the final classifier. This not only reduce computational burden but also makes selection of biomarker (associated with a dimension) a lot easier task.

Accuracy in Parameter Estimation for Standardized Effect Sizes
April 16, 2007 -


Ken Kelley, Indiana University

Abstract: In the behavioral, educational, and social sciences, there has been a major push to report effect sizes and their corresponding confidence intervals instead of or in addition to the results of null hypothesis significance tests. With the increased frequency of reporting confidence intervals, a serious problem has manifested:
"embarrassingly large" confidence intervals (Cohen, 1994) and parameter estimates that may not accurately reflect their corresponding population values (regardless of whether or not the null hypothesis is rejected). Due to the arbitrary nature of many scales used in the behavioral, educational, and social sciences, the most widely reported effect sizes are standardized (e.g., the standardized mean, squared multiple correlation coefficient, coefficient of variation, etc). After a discussion of confidence interval formation for standardized effect sizes, an approach to sample size planning that emphasizes accuracy in parameter estimation (AIPE) is discussed in the context of widely used standardized effect sizes; AIPE and power analysis results are also compared. One approach yields the necessary sample size so that the expected confidence interval width is sufficiently narrow. A modification allows a desired degree of assurance to be incorporated into the sample size planning procedure so that the probability of obtaining a confidence interval no wider than desired can be specified by the researcher (e.g., 99% assurance that the 95% confidence interval will be less than w units wide). It will be shown that the methods discussed can be easily implemented in the MBESS R package.


Local and Global Analytic Curve Estimation
April 09, 2007 -


Cidambi Srinivasan, University of Kentucky

Abstract: Several methods have been developed in Functional Data Analysis to estimate a mean response function, but most of these methods do not lend themselves to simultaneous estimation of the mean response and its derivatives. Being able to recover derivatives accurately is important in applications involving velocities and accelerations, for characterizing nanoparticles from scattering data, and for analyzing complex systems described by differential equations. This talk proposes a novel global estimator derived from a calculus of variations problem. The estimator is analytic and hence can be directly differentiated to estimate the derivatives of the mean response. In particular, the estimator and its derivatives converge uniformly to the mean response and (a finite but arbitrary number of) its derivatives on a compact interval. The theoretical properties, the finite sample refinements for practical implementation, and the empirical performance of the estimator will be discussed.

Statistical Failure Diagnosis in Software and Systems
March 27, 2007 -


Alice Zheng, Carnegie Mellon University

Abstract: As software and systems become increasingly complex, the task of debugging also becomes increasingly difficult. Manual diagnosis can require sifting through millions of lines of code and output logs. In addition, large
systems contain many components, each complex on its own, and often interacting in unexpected ways. I present a case study illustrating how statistical machine learning algorithms, along with appropriate system instrumentation, can aid in failure diagnosis. I propose a statistical software debugging framework that collects information from past successes and failures via fine-grained instrumentation of the program and then analyzes this information to locate suspicious program predicates. I discuss the algorithmic challenges of the approach, and demonstrate a bi-clustering algorithm that is effective at simultaneously clustering failed runs and selecting useful predicates. Using this approach, it took a programmer 20 minutes to find a long-standing bug in a real-world software program which he had never seen before.

Combining Group-Based Trajectory Modeling and Propensity Score Matching in the Analysis of Non-experimental Longitudinal Data
March 26, 2007 -


Daniel Nagin, Carnegie Mellon

Abstract: A central theme of research on human development and psychopathology is whether a therapeutic intervention or a turning point event, such as a family break-up, alters the trajectory of the behavior under study. This talk describes an approach for using observational longitudinal data to make more confident causal inferences about the impact of such events on developmental trajectories. The method combines two distinct lines of research: Work on the use of finite mixture modeling to analyze developmental trajectories and work on propensity score matching. The propensity scores are used to balance observed covariates and the trajectory groups are used to control pretreatment measures of response. The trajectory groups also aid in identifying classes of subjects for which no good matches are available. The approach is demonstrated with an analysis of the impact of gang membership on violent delinquency based on data from a large longitudinal study conducted in Montréal.

Statistical Analysis of Bullet Lead Compositions as Forensic Evidence
March 08, 2007 -


Karen Kafadar, University of Colorado-Denver & Health Sciences Center

Abstract: Since the 1960s, the FBI has performed Compositional Analysis of Bullet Lead (CABL), a forensic technique that compares the elemental composition of bullets found at a crime scene to that of bullets found in a suspect's possession. CABL has been used when no gun is recovered, or when bullets are too small or fragmented to compare striations on the casings with those on the gun barrel. The National Academy of Sciences formed a Committee charged with the assessment of CABL's scientific validity. The report, ``Forensic Analysis: Weighing Bullet Lead Evidence'' (National Research Council, 2004), included discussions on the effects of the manufacturing process on the validity of the comparisons, the precision and accuracy of the chemical measurement technique, and the statistical methodology used to compare two bullets and test for a ``match''. This talk will focus on the statistical analysis: the FBI's methods of testing for a ``match'', the apparent false positive and false negative rates, the FBI's clustering algorithm (``chaining''), and the Committee's recommendations. Additional analyses on data later made available, the use of forensic evidence in general, also will be discussed.


Flury Lecture - Statistical Analysis of Bullet Lead Compositions as Forensic Evidence
March 08, 2007 - RH 100 4pm

Professor Karen Kafadar of the University of Colorado - Denver & Health Sciences Center

Since the 1960s, the FBI has performed Compositional Analysis of Bullet Lead (CABL), a forensic technique that compares the elemental composition of bullets found at a crime scene to that of bullets found in a suspect's possession. CABL has been used when no gun is recovered, or when bullets are too small or fragmented to compare striations on the casings with those on the gun barrel. The National Academy of Sciences formed a Committee charged with the assessment of CABL's scientific validity. The report, ``Forensic Analysis: Weighing Bullet Lead Evidence'' (National Research Council, 2004), included discussions on the effects of the manufacturing process on the validity of the comparisons, the precision and accuracy of the chemical measurement technique, and the statistical methodology used to compare two bullets and test for a ``match''.

This talk will focus on the statistical analysis: the FBI's methods of testing for a ``match'', the apparent false positive and false negative rates, the FBI's clustering algorithm (``chaining''), and the Committee's recommendations. Additional analyses on data later made available, the use of forensic evidence in general, also will be discussed.

Some important issues in the development of space-time correlation models
February 12, 2007 -


Chunsheng Ma, Wichita State University; SAMSI

Abstract: The world is dynamic at many scales in space and time, and the space and time interaction is prevalent in almost every field in the behavioral, social, environmental, informational, and geophysical sciences. Whenever possible and available, a rational approach for modeling spatio-temporal data would start from a theory or mechanism that explains the underlying physical knowledge. In reality, however, no obvious mechanism may exist, and frequently it is such a theory that needs to be developed from observational or experimental study. For this purpose, statistical techniques are often very important tools, and deterministic and stochastic models to demonstrate spatio-temporal mechanisms are prominent among these. Two commonly used tools to describe the space-time interaction and dependence are the covariance function and variogram. In this talk we will briefly survey some recent advances on how to construct spatio-temporal variograms and covariance functions, and discuss several issues in the development of space-time covariance models, which include separability, stationarity, smoothness, long-range dependence, the Gaussian assumption, the range of space-time correlation, and aliasing and embedding problems.

Empirical Likelihood Estimation for Missing Response Data
February 05, 2007 -


Biao Zhang, University of Toledo

Abstract: Missing data frequently occurs in health and social science studies. It is well known that an analysis based only on complete data is generally biased unless the missing-data mechanism is completely at random. In this talk, we discuss an empirical likelihood method for handling missing response data when the missing-data mechanism is covariate-dependent. In the case of estimation of mean response, the empirical likelihood method makes effective use of auxiliary covariate information under a working regression model and a working propensity model. The empirical likelihood-based estimator of the mean response is doubly robust, i.e., it is asymptotically consistent if either the underlying regression model or the underlying propensity model is correctly specified. Moreover, the estimator is asymptotically efficient when both the regression and propensity models are correctly specified. As an application, we consider estimation of average causal treatment effects in observational studies by viewing the causal inference as a two-sample missing data problem. Some numerical results are also presented.


Spline-Backfitted Kernel Smoothing of Additive Models in Time Series
January 25, 2007 -


Lily Wang, Michigan State University

Abstract: Application of non- and semi parametric regression techniques to high dimensional time series data have been hampered due to the lack of effective tools to address the "curse of dimensionality." Under rather weak conditions, we propose a spline-backfitted kernel estimator of the component functions for the nonlinear additive time series data that is both computationally expedient so it is usable for analyzing very high dimensional time series, and theoretically reliable so inference can be made on the component functions with confidence. Simulation experiments have provided strong evidence that corroborates with the asymptotic theory. Finally, the estimation procedure has been illustrated by the US unemployment rate data.

Functional Genomics of Quantitative Traits: Expression Level Polymorphisms of QTLs Affecting Disease Resistance Pathways in Arabidopsis (eQTL)
January 22, 2007 -


J. Rebecca Doerge, Purdue University

Abstract: There is increasing interest in understanding the molecular basis of complex traits. Initially, the genetic dissection of quantitative traits involved measurements of gross phenotypes. Most recently, the underlying mechanisms of inheritance have been studied through various approaches that are supported by modern technological and methodological advances, namely quantitative trait locus/loci (QTL) analysis and mutant analysis in genetics; genome sequencing and gene expression analysis in genomics; and protein structure analysis and protein assay in proteomics. Since each technology and approach focuses on specific pieces of the larger, poorly understood systems biology, the challenge is to integrate these different types of information to elucidate the genetic architecture of complex traits. To address one of these challenges we have combined QTL analysis with microarray analysis to characterize the genomic architecture that controls quantitative traits. Using Affymetrix technology and 211 individuals from a segregating Arabidopsis population, the transcript variation (i.e., expression level polymorphisms, ELPs) of 22,810 genes, in both control and treatment conditions, provide data for mapping expression QTL (eQTL). Results from our statistical analysis of the entire genome reveal both cis- and trans-eQTL under both control and treatment conditions. The statistical methodology developed for this type of analysis will be presented for a directed analysis of SA-inducible secretory genes controlled by NPR1.

Data Confidentiality: Where Statistics Meets Computer Science
December 31, 1969 - IMU Walnut Room 3:00 - 4:00 PM

Alan Karr - Director of the National Institute of Statistical Sciences

Government agencies and businesses face a multitude of tensions between protecting the confidentiality of their data (for legal, quality and other reasons) and allowing legitimate uses of the data (for policy, research or other purposes). The technical problems involve the statistical, mathematical and computational sciences, as well as domain science, all immersed in difficult legal and societal issues.

In this talk, I will outline a decision-theoretic formulation of data confidentiality problems as tradeoffs between quantified measures of data disclosure risk and data utility. Then, I will focus on two problems lying at the intersection of statistics and computer science. The first is methods and systems for secure, principled statistical analysis of distributed data. The second is verification servers, which provide users of publicly released data information that have been altered to protect confidentiality about the fidelity of their analyses as compared to analyses of the original data.