This series is open to all Indiana University faculty and students interested in statistical research.
Time and place: Selected Mondays from noon to 1pm in the Indiana Memorial Union.
Please join us for our next talk.
|
PAST COLLOQUIA
|
A variable weighting and selection procedure for K-means cluster analysis April 28, 2008 -
Douglas L. Steinley, University of Missouri-Columbia
A variance-to-range ratio variable weighting procedure is proposed. The method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the weighting technique. The performance of these procedures are compared to existing methods in the literature.
|
Birthweight Distribution and Infant Mortality: Thinking Outside the Curve April 24, 2008 -
Professor Richard Charnigo - University of Kentucky
Greater epidemiologic understanding of the relationships among fetal-infant mortality and its prognostic factors, including birthweight, could have vast public health implications. A key step toward that understanding is a realistic and tractable framework for analyzing birthweight distributions and heterogeneity in fetal-infant mortality. We propose describing a birthweight distribution using a normal mixture model in which the number of components is determined from the data, then estimating birthweight-specific mortality curves within each component of the normal mixture. We emphasize both methodological issues (e.g., How should the number of components be determined?) and interpretive issues (e.g., What do the components represent?). Data from the National Center for Health Statistics Public-Use Perinatal Mortality Data Files are used to compare our analytic framework to existing frameworks as well as to assess the reproducibility across repeated sampling of results obtained through our framework. (This talk is based on work with Lorie Wayne Chesnut, Tony LoBianco, and Russell S. Kirby.)
|
On the Frequency Polygon Estimator for Random Fields April 07, 2008 -
Michel Carbon - Rennes University, France
The purpose of this talk is to investigate the Frequency Polygon as a density estimator for stationary random fields indexed by multidimensional lattice points in space. Optimal binwidths which asymptotically minimize integrated mean square errors (IMSE) are derived. Under weak conditions, frequency polygons achieve the same rate of optimal uniform rate of convergence under general conditions. Rates of the a.s. convergence are given too. Finally, asymptotic normality of the frequency polygon estimator is established.
|
Predictive Regression Models for Nonstationary Economic and Financial Data March 20, 2008 -
Professor Zongwu Cai - University of North Carolina at Charlotte and Wang Yanan Institute for Studies in Economics, Xiamen University, China
Motivated by forecasting the inflation rate through nonstationary variables and efficient tests of stock return predictability as well as forecasts of the equity premium, this talk will focus on how to use nonparametric or semiparametric regression techniques to analyze nonstationary time series data. Development of a nonparametric approach to estimate the functionals will be discussed, as well as how the consistency and asymptotic normality of the proposed estimators are obtained. The asymptotic results have shown that the asymptotic bias is same for all estimators of functionals, but that the convergence rates are totally different for stationary and nonstationary covariates. These findings seem innovative in the literature.
|
The Additive-Interactive Nonlinear Volatility Model, Its Estimation And Some Testing Issues March 03, 2008 -
Michael Levine - Purdue University
We consider a new separable nonparametric volatility model that allows for “interactions” in both mean and conditional variance (volatility) function. It can be concisely described as an additive-interactive nonlinear ARCH model. We propose this model as a possible alternative to the generalized additive nonlinear ARCH (GANARCH) model of Kim and Linton (2004), with which it shares the common origin. Unlike the GANARCH model, it does not assume the known link function but includes second-order interaction terms in both mean and variance functions instead. This ensures a much more data-driven model compared to GANARCH of Kim and Linton (2004) since our assumptions do not assume that anything know about the data distribution. This is very beneficial since, in practice, the data distribution has to be selected based on the exploratory data analysis, which is very difficult for multivariate data. Thus, the proposed model is much more flexible compared to GANARCH.
Motivated by the local instrumental variable estimation method (LIVE), also introduced in Kim and Linton (2004), we propose instrumental variable-based estimators of the components of the mean and volatility functions. The estimators are shown to be consistent and asymptotically normal. Explicit expressions for asymptotic means and variances of these estimators are also obtained. Several simulation experiments are conducted that show a very good performance of our algorithm for moderate sample sizes. Finally, the method is applied to the real data set of currency exchange rates where it leads to some interesting conclusions.
Historically, multiple functional component testing in nonparametric models has been a fairly difficult problem. We introduce a novel F-type approach to testing the significance of the two-way interactive terms in the mean function based on the unbalanced design ANOVA with unequal variances. Simulation studies show that the method performs very well for sample sizes of about 5000, which are easily available in financial applications.
|
Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression February 25, 2008 -
Professor Chao-Ying Joanne Peng from Indiana University
For the past 25 years, advances have been made in missing data methods. Most published work has focused on missing data in the dependent variable under various conditions. The present study sought to fill the void by comparing two approaches for handling missing data in categorical covariates in logistic regression. These two approaches were EM method of weights and multiple imputation.
Sample data were first drawn randomly from a population with known characteristics. Missing data on covariates were subsequently simulated under two conditions: missing completely at random and missing at random with different missing rates. A logistic regression model was fit to each sample using either the EM or the MI approach. The performance of these two approaches was assessed on four criteria: bias, efficiency, coverage, and rejection rate.
Results generally favored MI over EM. Practical issues such as implementation, inclusion of continuous covariates, and interactions between covariates were discussed.
|
Convergence Rates and Asymptotic Standard Errors for MCMC Algorithms for Bayesian Probit Regression February 14, 2008 -
Vivekananda Roy University of Florida
We study Markov chain Monte Carlo algorithms for exploring the intractable posterior density that results when a probit regression likelihood is combined with a flat prior on the regression coefficient. We prove that the data augmentation algorithm of Albert and Chib (1993) and the PX-DA algorithm of Liu and Wu (1999) both converge at a geometric rate, which ensures the existence of central limit theorems (CLTs) for ergodic averages under a second moment condition. While these two algorithms are essentially equivalent in terms of computational complexity, we show that the PX-DA algorithm is theoretically more efficient in the sense that the asymptotic variance in the CLT under the PX-DA algorithm is no larger than that under Albert and Chib's algorithm. A simple, consistent estimator of the asymptotic variance in the CLT is constructed using regeneration. As an illustration, we apply our results to the lupus data from van Dyk and Meng (2001). In this particular example, the estimated asymptotic relative efficiency of the PX-DA algorithm with respect to Albert and Chib's algorithm is about 65, which demonstrates that huge gains in efficiency are possible by using PX-DA algorithm.
|
Applications of Empirical Likelihood to Quantile Estimation and Longitudinal Data February 04, 2008 -
Jien Chen - University of Georgia
As a non-parametric method, Empirical Likelihood (EL) has been attracting serious attention from researchers in statistics, econometrics, engineering and biostatistics. By defining the estimation equations in EL appropriately, we can extend EL to various data settings, especially those in which parametric likelihoods are absent. In this talk, I will provide two examples of such extensions: quantile estimation and longitudinal data analysis. Quantile estimation for discrete data analysis has not been well studied. For a given 0 < p < 1, the commonly used sample quantile may or may not be consistent for the pth quantile, depending on whether or not the underlying distribution has a plateau at the level of p. I propose an EL-based categorization procedure that not only helps determine the shape of the true distribution at level p, but also provides a way of formulating a new estimator that is consistent in any case. For non-Gaussian longitudinal data, generalized estimating equations (GEE) are a popular class of marginal models. While the GEE estimator is consistent and robust, it may suffer significant loss of efficiency if the working correlation structure is misspecified. I consider the use of EL to select working correlations for GEE models, for which parametric likelihoods are absent and quasi-likelihoods are difficult to construct.
|
A Multivariate Semiparametric Bayesian Spatial Modeling Framework for Hurricane Surface Wind Fields January 31, 2008 -
Brian Reich - North Carolina State University
Storm surge, the onshore rush of sea water caused by the high winds and low pressure associated with a hurricane, can compound the effects of inland flooding caused by rainfall, leading to loss of property and loss of life for residents of coastal areas. Numerical ocean models are essential for creating storm surge forecasts for coastal areas. These models are driven primarily by the surface wind forcings. Currently, the gridded wind fields used by ocean models are specified by deterministic formulas that are based on the central pressure and location of the storm center. While these equations incorporate important physical knowledge about the structure of hurricane surface wind fields, they cannot always capture the asymmetric and dynamic nature of a hurricane. A new Bayesian multivariate spatial statistical modeling framework is introduced combining data with physical knowledge about the wind fields to improve the estimation of the wind vectors. Many spatial models assume the data follow a Gaussian distribution. However, this may be overly-restrictive for wind fields data which often display erratic behavior, such as sudden changes in time or space. In this paper we develop a semiparametric multivariate spatial model for these data. Our model builds on the stick-breaking prior, which is frequently used in Bayesian modeling to capture uncertainty in the parametric form of an outcome. The stick-breaking prior is extended to the spatial setting by assigning each location a different, unknown distribution, and smoothing the distributions in space with a series of kernel functions. This semiparametric spatial model is shown to improve prediction compared to usual Bayesian Kriging methods for the wind field of Hurricane Ivan.
|
Designing Penalty Functions for Grouped and Hierarchical Selection January 28, 2008 -
Guilherme Rocha - University of California, Berkeley
Extracting useful information from high-dimensional data is an important focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task. Quasi-norms on model parameters are frequently used as a penalty. Classical examples are AIC and BIC where the L0 quasi-norm (model dimension) is used as a penalty.
More recently, penalization by the L1-norm (lasso) has enjoyed a lot of attention. L1-penalized estimates are cheaper to compute (convex optimization) and lead to more stable model estimates than their L0 counterparts.
In this talk, I will present the Composite Absolute Penalties (CAP) family of penalties. CAP penalties allow given grouping and hierarchical relationships between the predictors to be expressed. They are built by defining groups of variables and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for non-overlapping groups. Hierarchical variable selection is reached by defining groups with particular overlapping patterns.
Under easily verifiable assumptions, CAP penalties are convex: an attractive property from a computational stand-point. Within this subfamily, unbiased estimates of the degrees of freedom (df) exist so the regularization parameter is selected without cross-validation.
Simulation results show that CAP improves on the predictive performance of the LASSO for cases with p>>n and mis-specified groupings.
This is joint work with Peng Zhao and Bin Yu.
|
On Large Margin Hierarchical Classification January 14, 2008 -
Junhui Wang - Columbia University
Hierarchical classification is critical to knowledge and context management as well as knowledge exploration, as in gene function classification and discovery and document categorization. In hierarchical classification, an input is classified by a structured hierarchy. In a situation as such, the central issue is how to effectively utilize inter-class relationship to improve the generalization performance of flat classification ignoring such dependency. In this talk, a novel large margin method based on constraints characterizing multi-path hierarchy is presented within the framework of regularization. In particular, I will discuss three aspects: (1) the idea and methodology development; (2) computational tools; (3) a statistical learning theory. Numerical examples will be provided to demonstrate the advantage of our proposed methodology against other existing competitors. An application to gene function prediction and discovery will be discussed.
|
Higher Order Semiparametric Inference Based on the Profile Likelihood January 10, 2008 -
Guang Cheng - SAMSI
Semiparametric modeling is an excellent framework due to its flexibility to model some features parametrically without making assumptions on the other features. However, the infinite-dimensional nuisance parameter in the semiparametric models generally poses several challenges for making maximum likelihood inference for the parameter of interest at both theoretical and methodological levels. We will consider a series of profile likelihood based semiparametric inference procedures either based on numerical methods, i.e. K-step MLE, or through MCMC sampling, i.e. the Profile Sampler and the Penalized Profile Sampler. All the above profile likelihood based methods avoid evaluation of the infinite-dimensional operator and are easy to implement. Furthermore, we investigate their second order asymptotic behaviors, which are proven to be related to the convergence rate of the nuisance parameter and thus adjustable.
|
Visualization Databases for Lossless Analysis of Complex Data Sets November 26, 2007 -
William Cleveland -- Purdue University
Large, complex data sets are ubiquitous, the standard now rather than the exception. They present challenging problems of analysis because of their size and the complexity of their data structures and patterns. One approach is to compute summary statistics at the outset to reduce the complexity, but this expedient risks losing important information in the data. The goal should be lossless analysis: analyze the data at a level of detail and comprehensiveness that does not sacrifice
information.
Achieving lossless analysis of complex data today is immensely challenging. New fundamental approaches and methods are needed for each of the different areas that come into play in the analysis of the data --- databases, data processing, data structures, statistical models and methods, machine learning algorithms, data
visualization, computational algorithms, software environments, and hardware environments. In fact, it has never been harder to achieve lossless analysis because complexity has increased faster than our innovations in these areas.
Nothing serves lossless analysis better than data visualization, the only practical way to absorb large amounts of information in detail. But for today's complex sets we must visualize far larger amounts than in the past. We must be ready to accept large displays each covering tens or even hundreds of screensful (pages). For a single data set it is reasonable to have hundreds of such displays. These displays become a new database produced from the data that is queried and studied. For a display of 500 pages, we might query and study all or just a few of the pages depending on the task.
Producing, querying, and studying a visualization database needs new ideas. There are different modes of viewing the many pages and panels per page of a large display, from slow focused study to very rapid scans. We need creative interfaces to facilitate the different modes. We cannot fuss with very large displays, interacting with the micro-elements to get them right, because there is too much; instead there should be smart automation algorithms that get the large display right the first time. We must consider the physical screen space, its size and resolution, to make it work most effectively for the visual study. We need methods of display that result in pre-attentive visual formation of gestalts that show instantaneously the relevant patterns in the data. This necessitates, strangely, more displays, starting with broad brush looks to derivative displays whose redesigns show specific aspects of the broad brush more effectively. It also requires the study of visual perception.
|
Skewed Multivariate Distributions November 12, 2007 -
Tonu Kollo -- Institute of Mathematical Statistics,
University of Tartu
In the last ten years, remarkable development has occurred in the area of skewed multivariate distributions. Skew normal distribution was introduced in 1996 by Azzalini & Dalla Valle (Biometrika). Azzalini’s construction of the distribution was very fruitful and was later successfully applied to many other elliptical families of distributions. Random vector X is skew normally distributed with parameters α – a p-vector as the shape parameter and Σ - a positive definite p×p-matrix as the scale parameter when the density function of X is the product of the density function of N(0,Σ) and the distribution function of the standard normal distribution with the shape parameter appearing in the argument of the distribution function.
In the talk, basic properties of the skew normal family will be discussed and other more often used families of skewed elliptical distributions will be examined (multivariate skew t-distribution, for instance). With new families of distributions, new estimation and testing problems have risen. Classical estimation methods do not often work: maximum likelihood method can give wrong estimates and much bias is possible using moments’ method.
Another type of skewed multivariate distributions is presented by asymmetric Laplace distribution, which was carefully examined by T. Kozubowski in a series of papers at the end of 1990s. In this case, we do not have explicit expression for the density function and estimation, testing and fitting problems have to be solved on the basis of the characteristic function. This distribution will also be considered in more detail.
|
Principal Component Analysis of Forward Interest Rates November 05, 2007 -
Victor Goodman -- Indiana University
Forward Interest rates are simultaneously measured using up-to-the-minute bond trading quotes. The time variation of these rates determines a high-dimensional covariance matrix that might be used to model bond yields within a country's government bond market.
Several PC analyses, with 1989-92 data in the U.K. market, 1887-94 data in the U.S. market, and 2001-05 data in the U.S. market, reveal a striking pattern involving the first three eigenvectors of each covariance matrix. In this talk I describe the pattern and make the (well-known) case for using a three-factor Gaussian model to describe bond trading.
It is difficult to implement models based on PC estimates for the first eigenvector. An initial attempt to produce a model in 1988 ended in failure since the model had a financial inconsistency. A more recent model behaves better; its covariance has the desired eigenvector and the model is arbitrage-free. Surprisingly, the new model appears when we condition prices not to collapse in the old model. This suggests that an issue of survivorship may arise even in no-default bond markets.
|
What is quantum probability theory, and how can it be used to analyze measurements in the social and behavioral sciences? October 29, 2007 -
Jerome R. Busemeyer -- Indiana University
Social and behavioral scientists face some of the same measurement problems that forced physicists to abandon classical probability theory. Their measurements are often incompatible, and the first measurement may disturb a second measurement. Thus only partial information about a complex system can be obtained at any point in time. Combining partial information about a system into a coherent understanding of the entire system is the hallmark of quantum theory. Quantum theory provides a fundamentally different approach to logic, reasoning, and probabilistic inference. For example, quantum logic does not always follow the distributive axiom of Boolean logic; quantum probabilities do not always obey the Kolmogorov law of total probability; quantum reasoning does not always obey the principle of monotonic reasoning.
For this talk, I will present a tutorial of the basic assumptions of classic versus quantum probability theories. These basic assumptions will be examined, side by side, in a parallel and elementary manner. Classic theory will emerge as a possibly overly restrictive case of the more general quantum theory. The fundamental implications of these contrasting assumptions for measurement in the social and behavioral sciences will be examined.
|
Data-Driven Smooth Tests for the Martingale Difference Hypothesis October 22, 2007 -
Juan Carlos Escanciano -- Indiana University
Abstract: A general method for testing the martingale difference hypothesis is proposed. The new tests are data-driven smooth tests based on the principal components of certain marked empirical processes that are asymptotically distribution-free, with critical values that are already tabulated. The smooth tests are shown to be optimal in a semiparametric sense discussed in the paper, and they are robust to conditional heteroscedasticity of unknown form. A simulation study shows that the data-driven smooth tests perform very well for a wide range of realistic alternatives and have more power than omnibus and other competing tests. Finally, two empirical examples highlight the merits of our approach.
|
Dimension Augmented Import Vector Machine (DAIVM): A new General Classifier System for Large p Small n problem, with Application in Bio-Informatics October 01, 2007 -
Dr. Samiran Ghosh, IUPUI
Abstract: Support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. All of these approaches construct classifier based on training sample in a high dimensional space by using all available dimensions. SVM achieves huge data compression by selecting only few observations lying in the boundary of the classifier function. However when the number of observations is not very large (small n) but the number of dimensions are very large (large p) then it is not necessary that all available dimensions are carrying equal information in the classification context. Selection of only useful fraction of available dimensions will result in huge data compression. In this paper we have come up with an algorithmic approach by means of which such an optimal set of dimensions could be selected. We have reversed and modified the solution proposed by Zhu and Hastie in the context of Import Vector Machine (IVM), to select an optimal sub model by using only few observations. For large p small n domain (e.g., Bioinformatics) our method compares different trans-dimensional model to come up with optimal set of dimensions to build the final classifier. This not only reduce computational burden but also makes selection of biomarker (associated with a dimension) a lot easier task.
|
Accuracy in Parameter Estimation for Standardized Effect Sizes April 16, 2007 -
Ken Kelley, Indiana University
Abstract: In the behavioral, educational, and social sciences, there has been a major push to report effect sizes and their corresponding confidence intervals instead of or in addition to the results of null hypothesis significance tests. With the increased frequency of reporting confidence intervals, a serious problem has manifested:
"embarrassingly large" confidence intervals (Cohen, 1994) and parameter estimates that may not accurately reflect their corresponding population values (regardless of whether or not the null hypothesis is rejected). Due to the arbitrary nature of many scales used in the behavioral, educational, and social sciences, the most widely reported effect sizes are standardized (e.g., the standardized mean, squared multiple correlation coefficient, coefficient of variation, etc). After a discussion of confidence interval formation for standardized effect sizes, an approach to sample size planning that emphasizes accuracy in parameter estimation (AIPE) is discussed in the context of widely used standardized effect sizes; AIPE and power analysis results are also compared. One approach yields the necessary sample size so that the expected confidence interval width is sufficiently narrow. A modification allows a desired degree of assurance to be incorporated into the sample size planning procedure so that the probability of obtaining a confidence interval no wider than desired can be specified by the researcher (e.g., 99% assurance that the 95% confidence interval will be less than w units wide). It will be shown that the methods discussed can be easily implemented in the MBESS R package.
|
Local and Global Analytic Curve Estimation April 09, 2007 -
Cidambi Srinivasan, University of Kentucky
Abstract: Several methods have been developed in Functional Data Analysis to estimate a mean response function, but most of these methods do not lend themselves to simultaneous estimation of the mean response and its derivatives. Being able to recover derivatives accurately is important in applications involving velocities and accelerations, for characterizing nanoparticles from scattering data, and for analyzing complex systems described by differential equations. This talk proposes a novel global estimator derived from a calculus of variations problem. The estimator is analytic and hence can be directly differentiated to estimate the derivatives of the mean response. In particular, the estimator and its derivatives converge uniformly to the mean response and (a finite but arbitrary number of) its derivatives on a compact interval. The theoretical properties, the finite sample refinements for practical implementation, and the empirical performance of the estimator will be discussed.
|
Statistical Failure Diagnosis in Software and Systems March 27, 2007 -
Alice Zheng, Carnegie Mellon University
Abstract: As software and systems become increasingly complex, the task of debugging also becomes increasingly difficult. Manual diagnosis can require sifting through millions of lines of code and output logs. In addition, large
systems contain many components, each complex on its own, and often interacting in unexpected ways. I present a case study illustrating how statistical machine learning algorithms, along with appropriate system instrumentation, can aid in failure diagnosis. I propose a statistical software debugging framework that collects information from past successes and failures via fine-grained instrumentation of the program and then analyzes this information to locate suspicious program predicates. I discuss the algorithmic challenges of the approach, and demonstrate a bi-clustering algorithm that is effective at simultaneously clustering failed runs and selecting useful predicates. Using this approach, it took a programmer 20 minutes to find a long-standing bug in a real-world software program which he had never seen before.
|
Combining Group-Based Trajectory Modeling and Propensity Score Matching in the Analysis of Non-experimental Longitudinal Data March 26, 2007 -
Daniel Nagin, Carnegie Mellon
Abstract: A central theme of research on human development and psychopathology is whether a therapeutic intervention or a turning point event, such as a family break-up, alters the trajectory of the behavior under study. This talk describes an approach for using observational longitudinal data to make more confident causal inferences about the impact of such events on developmental trajectories. The method combines two distinct lines of research: Work on the use of finite mixture modeling to analyze developmental trajectories and work on propensity score matching. The propensity scores are used to balance observed covariates and the trajectory groups are used to control pretreatment measures of response. The trajectory groups also aid in identifying classes of subjects for which no good matches are available. The approach is demonstrated with an analysis of the impact of gang membership on violent delinquency based on data from a large longitudinal study conducted in Montréal.
|
Statistical Analysis of Bullet Lead Compositions as Forensic Evidence March 08, 2007 -
Karen Kafadar, University of Colorado-Denver & Health Sciences Center
Abstract: Since the 1960s, the FBI has performed Compositional Analysis of Bullet Lead (CABL), a forensic technique that compares the elemental composition of bullets found at a crime scene to that of bullets found in a suspect's possession. CABL has been used when no gun is recovered, or when bullets are too small or fragmented to compare striations on the casings with those on the gun barrel. The National Academy of Sciences formed a Committee charged with the assessment of CABL's scientific validity. The report, ``Forensic Analysis: Weighing Bullet Lead Evidence'' (National Research Council, 2004), included discussions on the effects of the manufacturing process on the validity of the comparisons, the precision and accuracy of the chemical measurement technique, and the statistical methodology used to compare two bullets and test for a ``match''. This talk will focus on the statistical analysis: the FBI's methods of testing for a ``match'', the apparent false positive and false negative rates, the FBI's clustering algorithm (``chaining''), and the Committee's recommendations. Additional analyses on data later made available, the use of forensic evidence in general, also will be discussed.
|
Some important issues in the development of space-time correlation models February 12, 2007 -
Chunsheng Ma, Wichita State University; SAMSI
Abstract: The world is dynamic at many scales in space and time, and the space and time interaction is prevalent in almost every field in the behavioral, social, environmental, informational, and geophysical sciences. Whenever possible and available, a rational approach for modeling spatio-temporal data would start from a theory or mechanism that explains the underlying physical knowledge. In reality, however, no obvious mechanism may exist, and frequently it is such a theory that needs to be developed from observational or experimental study. For this purpose, statistical techniques are often very important tools, and deterministic and stochastic models to demonstrate spatio-temporal mechanisms are prominent among these. Two commonly used tools to describe the space-time interaction and dependence are the covariance function and variogram. In this talk we will briefly survey some recent advances on how to construct spatio-temporal variograms and covariance functions, and discuss several issues in the development of space-time covariance models, which include separability, stationarity, smoothness, long-range dependence, the Gaussian assumption, the range of space-time correlation, and aliasing and embedding problems.
|
Empirical Likelihood Estimation for Missing Response Data February 05, 2007 -
Biao Zhang, University of Toledo
Abstract: Missing data frequently occurs in health and social science studies. It is well known that an analysis based only on complete data is generally biased unless the missing-data mechanism is completely at random. In this talk, we discuss an empirical likelihood method for handling missing response data when the missing-data mechanism is covariate-dependent. In the case of estimation of mean response, the empirical likelihood method makes effective use of auxiliary covariate information under a working regression model and a working propensity model. The empirical likelihood-based estimator of the mean response is doubly robust, i.e., it is asymptotically consistent if either the underlying regression model or the underlying propensity model is correctly specified. Moreover, the estimator is asymptotically efficient when both the regression and propensity models are correctly specified. As an application, we consider estimation of average causal treatment effects in observational studies by viewing the causal inference as a two-sample missing data problem. Some numerical results are also presented.
|
Spline-Backfitted Kernel Smoothing of Additive Models in Time Series January 25, 2007 -
Lily Wang, Michigan State University
Abstract: Application of non- and semi parametric regression techniques to high dimensional time series data have been hampered due to the lack of effective tools to address the "curse of dimensionality." Under rather weak conditions, we propose a spline-backfitted kernel estimator of the component functions for the nonlinear additive time series data that is both computationally expedient so it is usable for analyzing very high dimensional time series, and theoretically reliable so inference can be made on the component functions with confidence. Simulation experiments have provided strong evidence that corroborates with the asymptotic theory. Finally, the estimation procedure has been illustrated by the US unemployment rate data.
|
Functional Genomics of Quantitative Traits: Expression Level Polymorphisms of QTLs Affecting Disease Resistance Pathways in Arabidopsis (eQTL) January 22, 2007 -
J. Rebecca Doerge, Purdue University
Abstract: There is increasing interest in understanding the molecular basis of complex traits. Initially, the genetic dissection of quantitative traits involved measurements of gross phenotypes. Most recently, the underlying mechanisms of inheritance have been studied through various approaches that are supported by modern technological and methodological advances, namely quantitative trait locus/loci (QTL) analysis and mutant analysis in genetics; genome sequencing and gene expression analysis in genomics; and protein structure analysis and protein assay in proteomics. Since each technology and approach focuses on specific pieces of the larger, poorly understood systems biology, the challenge is to integrate these different types of information to elucidate the genetic architecture of complex traits. To address one of these challenges we have combined QTL analysis with microarray analysis to characterize the genomic architecture that controls quantitative traits. Using Affymetrix technology and 211 individuals from a segregating Arabidopsis population, the transcript variation (i.e., expression level polymorphisms, ELPs) of 22,810 genes, in both control and treatment conditions, provide data for mapping expression QTL (eQTL). Results from our statistical analysis of the entire genome reveal both cis- and trans-eQTL under both control and treatment conditions. The statistical methodology developed for this type of analysis will be presented for a directed analysis of SA-inducible secretory genes controlled by NPR1.
|
|