PAST EVENTS:
|
On Sampling Design Issues When Dealing with Zeros Date: March 26, 2012 - IMU Maple Room The problem of sampling form a population in which not all the units contribute to the calculation of an estimated population total was discussed in section 2.13 of Cochran’s Sampling Techniques and in King’s 1966 paper, “Sampling Design Issues When Dealing With Zeros.” The model studied there was one in which each item in the population had a fixed probability of being included in the estimate of the population total. We revisit that model under the particular assumption that the units in the population follow a gamma distribution, and under some other probability mechanisms governing which items are included in the estimate. In particular we study the effect of using stratified sampling on the estimated population total.
The consideration of this problem was motivated by a study of medical claims in which a subset were fraudulent. The US Department of Health and Human Services Office of Audit Services has issued sampling guidelines for such studies and RAT-STATS, a statistical software package to implement the sampling. We indicate in this paper the implications of our model for these sampling guidelines.
|
|
Department Colloquium: Strategies for Streaming Exploratory Data Analysis Date: March 19, 2012 - IMU Maple Room
|
|
Probabilistic Tools for Bayesian Inference in Directed Markov Random Fields Date: February 03, 2012 - TBA @ 2pm
|
|
STATISTICAL ESTIMATION IN MODEN ASTRONOMY: REPEATED INVERSE PROBLEMS AND FUNCTIONAL INTERPOLATION Date: January 26, 2012 - Kelley School CG 1008 10:00am In this talk, I will present new methods for tackling two related problems that arise in the data-processing pipelines of modern Astronomy. The first is a method for combining sequences of low-quality observations into an accurate estimate of the underlying signal of interest. This method has many advantages over existing ones, such as automatic tuning parameter selection and a strong theoretical backing. The second method I will present is a new technique for comparing astronomical images when one of the images is of much lower quality than the other. This is an important step in the automated real-time discovery of interesting transient phenomena. Both methods are being developed for use in the Large Synoptic Survey Telescope (LSST).
Bio: Darren Homrighausen graduated cum laude from the University of Colorado at Denver in 2006, earning a B.S.in Mathematical Economics. He is currently a PhD candidate in the Department of Statistics at Carnegie Mellon University. His research interests include inverse problems in signal processing applications and the application of statistical methodology to astronomy. Mr. Homrighausen is a candidate for a faculty position.
|
|
GENERALIZED FIDUCIAL INFERENCE FOR LINEAR MIXED MODELS Date: January 23, 2012 - Kelley School of Business - CG 1014 3:00pm Under the GFI paradigm, inference is performed by considering the generalized fiducial distribution on the parameter space, which has flexibility similar to a posterior distribution in Bayesian methods. GFI can be thought of as a transfer of probability from the model space to the parameter space, and a generalized fiducial distribution is defined for the unknown parameters of the model. In this talk, I will discuss how the generalized fiducial framework can be applied to some linear mixed model settings. Similar to Bayesian methodology, GFI is a computationally-based mode of inference, and we develop sequential Monte Carlo algorithms to obtain samples from the generalized fiducial distribution of the unknown parameters. The focus will be on normal linear mixed models, but logistic regression with mixed effects will also be addressed. In the normal linear mixed model setting, the proposed method is found to be competitive or better than competing methods when evaluated based on frequentist criteria of empirical coverage and average length of confidence intervals.
Bio: Jessi Cisewski earned a B.S. in Mathematics from the University of Notre Dame in 2005. She is currently a Ph.D. candidate in the Department of Statistics & Operations Research at the University of North Carolina at Chapel Hill. Her research interests include fiducial inference, linear mixedmodels, and the use of statistics in astronomy. Ms. Cisewski is a candidate for a faculty position.
|
|
COMMUNITY DETECTION AND EXTRACTION IN NETWORKS Date: January 19, 2012 - IMU Maple Room 10:00am Bio: Yunpeng Zhao earned a B.S. in Statistics from the University of Science & Technology of China in 2007 and is currently a Ph.D. candidate in the Department of Statistics at the University of Michigan. His research interests include network analysis, the spectral theory of random matrices, and machine learning. Mr. Zhao is a candidate for a faculty position.
|
|
Colloquium Speaker - Zhengyuan Zhu
Date: November 28, 2011 - IMU Maple Room 3 PM - 4 PM Spatial sampling design problems have been studied by statisticians for
many different application areas such as agriculture, soil
science, and environmental science. Though many of the methodologies
in spatial sampling design can be used to help design the sampling
plan of wireless sensor networks (WSN), WSN has some characteristics
such as the energy and communication constraints which are not
present in a traditional sampling network, which poses new
challenges to statisticians. In this talk we will give an overview on
spatial sampling design and discuss its relationship to the sampling
design for WSN. An example of spatial sampling design for regional
trend estimation and some preliminary results on the optimal sampling
design of a WSN for parameter estimation under energy and
communication constraints will be presented.
|
|
Yoav Benjamini ~ Hierarchical Testing of families of Hypotheses Date: October 24, 2011 - CG 1034 ~ Kelley School of Business, 3pm - 4pm As the size of large testing problems encountered in genomic research keeps increasing, more of these problems have further structure where the set of hypotheses can be further partitioned into families of the hypotheses, and the true state of the tested signals tends to be more similar within these subsets than across the subsets. Moreover, interest may lie with a discovery of a family with some signal in it, on top of the discovery of a signal in each of the many hypotheses on its own.The challenges in the analysis of such multiple testing problems will be discussed. We then present the concept of control on the average over the selected families of the desired error-rate, be it the familywise error rate the False Discovery Rate, or their generalizations. We discuss the various considerations involved using the genomic part of a Norwegian epidemiological study of breast cancer, and a study involving genomics and brain imaging.
|
|
Stat Day 2011
Date: April 25, 2011 - IMU Dogwood Room The Department of Statistics at Indiana University, Bloomington, will host the third annual "Stat Day" on April 25, 2011. The first Stat Day was hosted by Purdue University in 2009. In 2010, it was held at IUPUI. This year's Stat Day will be an informal gathering of faculty and students from the Department of Statistics and colleagues from universities throughout the state of Indiana. Stat Day was created to bring together statisticians with varied research interests as a way to facilitate the sharing of research ideas and the forming of collaborations. Participants share their research through presentations and group discussions. Presentations begin at 10:00am. All talks will be held in the Dogwood Room in the Indiana Memorial Union. The final talk of the day, given by Dr. Karen Kafadar, will chronicle her work on the FBI investigation of the 2001 Anthrax mailings.
|
|
Speaker - Dr. Karen Kafadar
Date: March 30, 2011 - Ballantine Hall room 321 3 PM - 4 PM On February 15, a Committee of the National Academy of Science released its report on the scientific approaches used in the investigation into the origins of the anthrax found in letters mailed to New York City and Washington D.C. in October 2001. Findings in the report included: (1) the available scientific evidence alone was insufficient to reach a definitive conclusion; (2) the letters included small amounts of silicon but no evidence that it was added as a dispersant for added weaponization; (3) spores in the letters and in RMR-1029, a flask found at U.S. Army Medical Research Institute for Infectious Diseases (USAMRIID), share a number of genetic similarities, which could arise in several ways; (4) RMR-1029 was not the immediate "parent material" for spores used in the letters. This talk will discuss the data made available to the Committee that was used in the statistical analyses which led to these findings, focusing primarily on finding (3).
The press release and full report can be found at http://www.nationalacademies.org/onpinews/newsitem.aspx RecordID=13098
|
|
Colloquium Speaker - Ba Chu
Date: March 28, 2011 - IMU Maple Room 3 PM - 4 PM This paper studies the weighted convergence of partial-sum processes for strictly stationary time series, effectively extending the existing results proved for sequences of i.i.d. random variables and moving-averages, by making use of A functional Hungarian construction for sums of independent random variables. The present result is then employed to construct a new nonparametric [k-NN-based] test for change-points in volatility in time series models. The weighted convergence of this test statistic is established and shown to be free from the estimation effect arising from nonparametric estimation of unknown conditional expectation and volatility functions.
|
|
Flury Lecturer - Stephen Stigler
Date: March 10, 2011 - Swain East 140 0 True multivariate statistical analysis was born at 10:00 on the morning of Thursday, September 10, 1885. Its origin and adoption over the next half-century required the rejection of a mode of mathematical thought that had dominated science for two thousand years. The theory stands as an achievement no less than those of Darwin in evolution and Einstein in relativity. Previous historical accounts have misconstrued the fundamental but subtle nature of the change that took place. All of this will be cleared up and the implications (ranging from the philosophy of inference to predicting the results of the NCAA Tournament) briefly discussed, with a minimum of formal mathematics.
|
|
Colloquium Speaker - Gabriel Huerta
Date: January 24, 2011 - IMU Walnut Room 3 PM - 4 PM We introduce some novel approaches for extremes value analysis that rely on Bayesian dynamic linear modeling and intrinsic Gauss-Markov Random Fields. In particular, we characterize extreme precipitation from a regional climate model via a hierarchical structure based on the Generalized Extreme Value distribution (GEV) that assigns a latent spatial process to its location and scale parameters. In addition, the statistical modeling includes an annual shift in the location parameter that may be able to predict trends over time. Furthermore and in the context of dynamic models, we consider time-varying autoregressions (TVAR) and focus on issues of time-domain decompositions which allow for inferences on the underlying structure of non-stationary time series. In particular, we emphasize on TVAR models that deal with model order uncertainty and show its relevance to the analysis of electroencephalographic (EEG) traces.
|
|
Colloquium Speaker - Xiaogang Su
Date: January 10, 2011 - IMU Persimmon Room 3 PM - 4 PM Assessment of treatment effects in observational studies is a multifaceted problem that not only involves heterogeneous mechanisms of how the cause is exposed to subjects, known as propensity, but also differential causal effects across sub-populations. We introduce a concept, termed facilitating score, to account for both the confounding and interacting impacts of covariates on the treatment effect. Several methods for estimating the facilitating scores are discussed. In particular, we put forward a nonparametric causal inference tree (CIT) method which provides a piecewise constant approximation to the facilitating score. CIT splits data in such a way that both the propensity and the causal effect become more homogeneous within the resultant strata and hence causal effects can be conveniently assessed with crude estimates. We also develop modified interaction trees so that differential causal effects can be sought out in a more reliable way. We evaluate the performance of the proposed methods through a simulation study and illustrate their use with the National Supported Work (NSW) data in Dehejia and Wahba (JASA, 1999) where the objective is to assess the impact of a labor training program, the NSW demonstration, on post-intervention earnings.
|
|
Colloquium Speaker - Arup Bose
Date: November 15, 2010 - IMU Walnut Room 3 PM - 4 PM We present a unified approach to limiting spectral distribution (LSD) of patterned matrices via the moment method. We explain the LSD of common matrices and provide insight into the nature of different LSD and their interrelations. The method is flexible enough to be applicable to matrices with appropriate dependent entries, banded matrices, and matrices of the form XX'.
In particular, the semicircular LSD may arise as the limit of non Wigner matrices. We also show how the notion of half independence arises in random matrices and the symmetrised Rayleigh limit is obtained. This leads to the notion of half convolution of arbitrary probability measures and the study of their properties.
|
|
Colloquium Speaker - Jun Dai
Date: November 08, 2010 - IMU Maple Room 3 PM - 4 PM Epidemiologic studies involve statistical methods and techniques in the study design and the data analyses. A critical issue in the epidemiologic study is to control confounding, where statistics can play an important role. My recent twin study on Mediterranean diet and cardiovascular disease demonstrates implication of the new statistical method to analyze data collected from a twin study, where genetic and common environmental confounding are additionally controlled for. Dietary habits are acquired while growing up, they are likely to be associated with other familial factors including genetic factors and other environmental conditions shared by members of the same family such as unmeasured socioeconomic and lifestyle factors. A twin study design is ideal to control for confounding from shared genes and common environment, since two members of a twin pair are naturally matched for genes (identical twins share 100% genes while fraternal twins share, on average, 50% genes) and common environment. Using a twin design, my research explores the underlying mechanism of cardioprotective properties of Mediterranean diet, and evaluates whether reported associations between dietary factors and cardiovascular disease can be confirmed after controlling for genetic and environmental factors.
|
|
Colloquium Speaker - Jose Figueroa-Lopez
Date: October 11, 2010 - IMU Maple Room 3 PM - 4 PM The Geometric Levy Model (GLM) is one of the most tractable alternatives to the Geometric Brownian Motion intrinsic in the seminal Black-Scholes model. By essentially requiring one additional parameter, certain GLM are able to fit extremely well daily returns of financial data as numerous empirical studies have shown. In spite of their importance and popularity nowadays, few works have considered intraday data. In this talk, we will analyze the ability of two popular GLM (the Variance Gamma and Normal Inverse Gaussian) to fit the statistical features of intraday data at different sampling frequencies. We are also interested in studying the limitations and virtues of the two most favored parametric estimation methods, the Method of Moments Estimators (MME) and the Maximum Likelihood Estimators (MLE), when dealing with high-frequency data under a GLM. By Monte Carlo simulations, we found out that neither high-frequency sampling nor MLE will reduce significantly the estimation error of the volatility parameter. On the other hand, one can significantly reduce the estimation error of the parameter controlling the kurtosis of the model by using MLE or intraday data. In the empirical implementation of our models, we found that the estimator of the volatility parameter is quite stable at different frequencies in contrast to the kurtosis estimator which is more sensitive to market microstructure. By characterizing the effect of a microstructure noise component in the estimation results, we propose a heuristic method to determine suitable sampling frequencies for which the Levy Model provides a good fit and the estimation results are approximately optimal.
This is a joint work with Steven Lancette, Kiseop Lee, and Yanhui Mi.
|
|
Colloquium Speaker - Ming Tan
Date: October 04, 2010 - IMU Maple Room 3 PM - 4 PM Many biomedical research problems boil down to constrained parameter models. Our work was originally motivated by analyzing tumor xenograft experiments in cancer drug development where tumor volumes are measured repeated and the size of the control tumor increases over time. On the other hand, in analyzing high dimensional genomic data, a plethora of regularized statistical learning models, which are based on constrained optimization, have been developed. We first introduce a non-iterative sampling procedure for calculating Bayesian posteriors, which eliminates problems of convergence associated with a Markov chain Monte Carlo approach. We then present feature selection models based on regularized ROC functions such as the F-measure. A colon cancer study will be used to illustrate the methods.
|
|
Colloquium Speaker - Bo Li
Date: April 26, 2010 - IMU Walnut Room 3:00 PM - 4:00 PM Understanding the dynamics of climate change in its full richness requires the knowledge of long temperature time series. Although long-term, widely distributed temperature observations are not available, there are other forms of data, known as climate proxies, that can have a statistical relationship with temperatures and have been used to infer temperatures in the past before direct measurements. We propose a Bayesian hierarchical model to reconstruct past temperatures that integrates information from different sources, such as proxies with different temporal resolution and forcings acting as the external drivers of large scale temperature evolution. Additionally, this method
allows us to quantify the uncertainty of the reconstruction in a rigorous manner. The reconstruction method is assessed, using a global climate model as the true climate system and with synthetic proxy data derived from the simulation. The target is to reconstruct Northern Hemisphere temperature from proxies that mimic the sampling and errors from tree ring measurements, pollen indices and borehole temperatures. The forcing series used as covariates are solar irradiance, volcanic aerosols and green house gas concentrations. The Bayesian model was successful in integrating these different sources of information in creating a coherent reconstruction. Within the context of this numerical testbed, a statistical process model that includes the external forcings can improve the quality of a hemispheric reconstruction when long time scale proxy information is not available.
|
|
Colloquium Speaker - Bob Bell
Date: April 12, 2010 - Kelley School of Business CG1040 3:00 PM - 4:00 PM In October 2006, the DVD rental company Netflix released more than 100 million user ratings of movies for a competition to predict users’ ratings based on prior ratings. One allure to data analysts around the world was a $1,000,000 prize for a team achieving a ten percent reduction in root mean squared prediction error relative to Netflix’s current algorithm. The size of the data (over 17,000 movies and 480,000 users) and the nature of human-movie interactions produced many modeling challenges. After describing some of the techniques in use and advances spurred by the competition, I will offer lessons and raise some questions about building massive prediction models, the role of statistics versus computer science in such endeavors, and prizes as a way to advance science. This is joint work with Chris Volinsky and Yehuda Koren, current and former colleagues at AT&T Labs-Research.
|
|
Colloquium Speaker - Adam Rothman
Date: April 05, 2010 - IMU Maple Room 3:00 PM - 4:00 PM This talk will present some methods and asymptotic theories that we have developed for sparse estimation of the covariance matrix and the inverse covariance (concentration) matrix in high-dimensional settings. An estimate of the covariance matrix or its inverse is needed for classification by discriminant analysis, Gaussian graphical model inference, and principal components analysis. We highlight two methods that are invariant to the ordering of the variables, and for both, we obtain explicit convergence rates in matrix norms that show the trade-off between the sparsity of the true model, dimension, and the sample size. These sparse covariance estimators are compared to other estimators on simulated data and on data examples from gene microarray experiments. If time permits, we will discuss covariance estimators that exploit variable ordering.
|
|
Colloquium Speaker - Uschi Mueller
Date: March 29, 2010 - IMU Walnut Room 3:00 PM - 4:00 PM My talk will focus on linear and nonlinear regression, with a response variable that is allowed to be “missing at random”. My only structural assumptions on the distribution of the variables are that the errors have mean zero and are independent of the covariates. The independence assumption is important: it enables us to construct easy-to-implement estimators for expectations of the joint distribution, and estimators for the response density that use all the data. This is in contrast to the usual local smoothing techniques and therefore permits a faster rate of convergence. The idea is to write the quantities of interest as integrals, which can be estimated by empirical versions, with a weighted residual based kernel estimator plugged in for the error density. For an appropriate class of regression functions and a suitably chosen bandwidth the proposed estimators are consistent and converge with the optimal parametric rate n1/2. Moreover, the estimators are proved to be efficient (in the sense of H´ajek and Le Cam) if an efficient estimator for the regression parameter is used.
|
|
Colloquium Speaker - Yoon Lee
Date: March 22, 2010 - IMU Walnut Room 3:00 PM - 4:00PM Classification is an important statistical problem with a wide range of applications. A variety of statistical tools have been developed for learning a classification rule from data. Understanding of their relative merits and comparisons would help users to choose a proper method in practice. This talk focuses on comparison of model-based classification methods in statistics with algorithmic methods in machine learning in terms of the error rate. Extending Efron’s comparison of logistic regression with linear discriminant analysis (LDA) under the normal setting, we contrast the support vector machine and boosting with LDA and logistic regression and study their relative efficiency. The limiting behavior of the classification boundary given by each method determines the efficiency. In addition to the theoretical study, we carry out numerical experiments for more comprehensive comparison of the methods under different settings than the normal setting.
|
|
Colloquium Speaker - Marlos Viana
Date: March 08, 2010 - IMU Walnut Room 3:00 PM - 4:00 PM Studies that elicit experimental data requiring visual perception of symmetry often depend (implicitly) on the encoding of planar space orientation in terms of up-down and left-right, which, of course, is arbitrary. In this seminar I will apply the methods of symmetry studies to formulate the statistical summaries of data that resolve that implicit arbitrariness and invite the discussion of their potential interpretations.
|
|
Colloquium Speaker - Chen Yu
Date: March 01, 2010 - IMU Walnut Room 3:00 PM - 4:00 PM Data-driven knowledge discovery becomes a new trend in various scientific fields. In this talk, I will introduce a novel framework to study one intriguing topic in cognitive and behavioral studies -- multimodal communication between human-human and human-robot interaction. We’ve proposed and developed an overall solution in this new data-driven paradigm, from data capture, to data coding and validation, and to data analysis and visualization. In data collection, we have developed a multimodal sensing system to collect fine-grained video, audio and body movement data. Next, how to automatically and effectively discover new knowledge from rich multimedia data poses a challenge as most state-of-the-art data mining techniques can only search and extract pre-defined patterns from complex heterogeneous data. I will present two research lines to address this challenge. First, we propose a visual data mining approach that allows us to use data mining as a first pass, and then forms a closed loop of visual analysis of current results followed by more data mining work inspired by visualization, the results of which can be in turn visualized and lead to the next round of visual exploration and analysis. In this way, new insights and hypotheses gleaned from the raw data and the current level of analysis can contribute to further analysis. This idea is implemented by a visualization system through which we can explore and query multi-stream time series derived from raw multimedia data. The second research effort is to view those time series as generated by nonlinear dynamical systems. Without knowing explicit dynamical equations, we attempted to capture temporal dynamics encoded in time series which is implemented by quantifying information flows between multi-stream time series based on symbol dynamics and information-theoretic measures. We suggest that this data-driven paradigm will not only lead to new discoveries in understanding multimodal communication but also more generally serve as a successful case study to demonstrate the promise of data-intensive discovery which can be applied in various research topics in cognitive and behavioral studies.
|
|
Colloquium Speaker - Joon Park
Date: February 22, 2010 - IMU Walnut Room 3:00 PM - 4:00 PM We derive the asymptotics of the maximum likelihood estimators for diffusion models. The models considered in the paper are very general, including both stationary and nonstationary diffusions. For such a broad class of diffusion models, we establish the consistency and find the limit distributions of the exact maximum likelihood estimator, and also the quasi and approximate maximum likelihood estimators based on various versions of approximated transition densities. Our asymptotics are two dimensional, allowing the sampling interval to decrease as well as the time span of sample to increase. The two dimensional asymptotics provide a unifying framework for the development of statistical theories for the stationary and nonstationary diffusion model. More importantly, they yield the asymptotic expansions that are very useful to analyze the exact, quasi and approximate maximum likelihood estimators of the diffusion models, if the samples are collected at high frequency intervals over modest lengths of sampling horizons as in the case of many practical applications.
|
|
Colloquium Speaker - Kim Huynh
Date: February 08, 2010 - IMU Persimmon Room 4:00 PM - 5:00 PM This paper investigates the evolution of firm distributions for entrant manufacturing firms in Canada using functional principal components analysis. This method is nonparametric, describes the dynamic of marginal densities and illustrates the efficacy of functional principal components to analyze firm distributions. We modify the Kneip and Utikal (2001) method to allow for the inclusion of qualitative information in the form of discrete variables such as industry and region. The results indicate that there are substantial differences in the dynamics of firm size, labour productivity, and leverage distributions between two cohorts. We also utilize a bootstrap test for the null hypothesis that the distributions are time-invariant. The null hypothesis is rejected for all variables and both cohorts when ignoring the qualitative information. However when accounting for industry and regional effects, acceptance occurs for size and leverage, while rejection occurs for labour productivity in both cohorts. These results show the importance of including qualitative information when applying functional principal component analysis to account for potential heterogeneity.
|
|
Colloquium Speaker - Alessandro Vespignani
Date: January 25, 2010 - IMU Walnut Room 3:00 PM - 4:00 PM The crucial issue when planning for adequate public health interventions to mitigate the spread and impact of infectious diseases is risk evaluation and forecast. This amount to the anticipation of where, when and how strong an epidemic will strike. In the last decade advances in computing paradigms and data acquisition and analysis allow the generation of sophisticated simulations on supercomputer infrastructures to statistically anticipate the spreading pattern of a pandemic. For the first time we are in the position of generating real time forecast of epidemic spreading. In the present lecture I will briefly review the history of the H1N1 pandemic facts, the major road-blocks the community has faced in its containment and mitigation and how statistics and computing have allowed the development of novel predictive technologies that help us to battle epidemics.
|
|
Colloquium - Speaker Yoosoon Chang Date: November 30, 2009 - IMU Persimmon Room 3 PM - 4 PM This paper develops a new framework and tools to reexamine Fama-French regressions. For Fama-French portfolios, we consider a continuous-time factor model with a specific error component structure implied by the underlying asset pricing theory. The model is then analyzed as a continuous-time multivariate regression with a general martingale differential error. Our framework is broad enough to accommodate some of the important common features of the errors in this type of regressions. In particular, we allow for time-varying or stochastic volatilities that are persistent and have strong leverage effects. It is well known that such nonstandard features would make the standard inferential procedure invalid. We overcome this difficulty by using samples collected at random intervals, instead of those sampled at fixed intervals such as monthly and yearly, which are set by the clock running inversely to the market volatility. Under our sampling scheme, Fama-French regressions may simply be regarded as the classical regressions having normal errors with variance given by the averaged quadratic variation of the martingale differential error. Various tests, which have been used to evaluate Fama-French factors, are extended and evaluated in the paper.
|
|
Colloquium Speaker -Zaichao Du Date: November 16, 2009 - IMU Persimmon Room 3 PM - 4 PM In this paper, we propose a modified Box-Pierce test for conditional goodness-of-fit. Our method is based on the fact that under the correct specification of the conditional distribution the generalized errors obtained after the probability integral transformation are iid U[0,1]. Our test explicitly takes into account the parameter estimation effect, as a result it has a convenient standard chi-square limit distribution. Our test is applicable to a wide class of models, including but not limited to ARMA-GARCH model, Hansen (1994) skewed t model and autoregressive conditional duration model. A simulation study shows that our test has satisfactory size and power performance. An empirical application to the Hang Seng Index data highlights the merits of the proposed test.
|
|
Colloquium - Speaker Tao Shi Date: October 26, 2009 - IMU State Room East 3pm - 4pm In this talk, we focus on obtaining clustering information in a distribution when iid data are given. First, we develop theoretical results for understanding and using clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function (with a sufficiently fast tail decay). We study which eigenvectors
should be used and when the clustering information for the distribution can be recovered from the data. Second, we use heuristics from these analyses to design the Data Spectroscopic clustering (DaSpec) algorithm. Our findings not only extend and go beyond the intuitions underlying existing spectral techniques (e.g. spectral clustering and Kernel Principal Components Analysis), but also provide insights about their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the promise of our proposed data spectroscopy clustering algorithm relative to k-means and one spectral method. In particular, DaSpec seems to be able to handle unbalanced groups and recover clusters of different shapes better than competing methods. This is joint work with Prof. Mikhail Belkin (Ohio State University)and Prof. Bin Yu (University of California, Berkeley).
|
|
Colloquium Speaker - Shankar Bhamidi Date: October 12, 2009 - IMU Persimmon Room 3 PM - 4 PM The last few years have seen an explosion in the amount of data on many real world networks. This has resulted in an interdisciplinary effort in formulating models to understand the data. On the theory side, we shall look at how powerful techniques in modern probability theory can be used in this context via the following three problems:
1. Reconstruction of routing trees: In a number of problems that arise from trying to discover the underlying structure of the Internet, it is often impossible to take direct measurements at the routers. We shall describe progress in trying to reconstruct the "Multicast" tree exactly using only "end-to-end" measurements. Surprisingly, using deep facts from Phylogenetics, we show that this can be done using very few samples.
2. MCMC simulation of exponential random graphs: Exponential random graphs are one of the most used models in social network theory. The basic idea is the following: In social networks we see more triangles cliques etc than we would expect in a random graph, basically because if A is a friend of B and A is a friend of C then it is quite likely that B and C are friends. One way to model such a phenomenon is to attach, for every graph G,a Hamiltonian given by say
H(G) = β#E(G) + γ#T(G)
where E(G) and T(G) are the number of edges and triangles respectively and then looking at the Gibbs distribution induced by this Hamiltonian. Simulating from these models is of paramount interest.
Using the modern day theory of Markov Chains we and in the ferromagnetic setup, exactly when one can simulate from this model effciently and when it would take exponentially long to simulate from this model.
3. Spectral distribution adjacency matrices: How good is the spectral distribution of the adjacency matrix of a network in estimating key features of the network? Given two samples of networks can we always tell them apart just by looking at their spectral distribution? These sorts of questions lead us into analyzing the asymptotics of the spectral distribution of a number of random tree models.
|
|
Colloquium Speaker - Daniel Crichton Date: September 28, 2009 - IMU Persimmon Room 3 PM - 4 PM Modern science research is requiring collaboration amongst geographically distributed scientists and their data assets. Because of this, a new era of scientific discovery exists in which science data is shared and validated across research institutions. More than ever, computing infrastructures that span these multiple institutions must work together to support collaborative research. At NASA’s Jet Propulsion Laboratory, we have been working in multiple disciplines to support the movement towards highly distributed scientific data systems that span, end-to-end, the data pipeline from instruments all the way to analysis. JPL developed a software technology called the Object Oriented Data Technology (OODT) framework that provides building blocks for construction of distributed data systems, addressing the challenges of the distributed, data-intensive domain. OODT has been successfully infused into planetary science, earth science and cancer research programs. This talk will explore the commonalities of developing data system infrastructures in these disciplines as well as our experiences, results and plans for data systems in the future.
|
|
Colloquium Speaker Alan Karr Date: April 13, 2009 - IMU Walnut Room - 3 PM -4 PM Government agencies and businesses face a multitude of tensions between protecting confidentiality of their data (for legal, quality and other reasons) and allowing legitimate uses of the data (for policy, research or other purposes). The technical problems involve the statistical, mathematical and computational sciences, as well as domain science, all immersed in difficult legal and societal issues.
In this talk, I will outline a decision-theoretic formulation of data confidentiality problems as tradeoffs between quantified measures of data disclosure risk and data utility. Then, I will focus on two problems lying at the intersection of statistics and computer science. The first is methods and systems for secure, principled statistical analysis of distributed data. The second is verification servers, which provide users of publicly released data information that have been altered to protect confidentiality about the fidelity of their analyses as compared to analyses of the original data.
|
|
Colloquium Speaker David Marchette Date: April 06, 2009 - IMU Walnut Room - 3 PM -4 PM Implicit translation is the association of documents in different languages which are on the same topic, in the absence of a translation dictionary. The mult-lingual Wikipediae will be used as a test-bed to explore some techniques based on the embeddings of the Wikipedia graphs. I will discuss several standard graph embeddings, and a novel random graph embedding. These are applied to a subset of the English and French Wikipediae, showing that using the graph information alone is sufficient to obtain good results. The incorporation of the language information via word-count histograms or related bag-of-words approaches, in conjunction with the graph information, will be discussed briefly.
|
|
Flury Lecture - Guest Speaker David Scott Date: March 03, 2009 - Swain East 140 4:00 PM - 5:00 PM Modern science relies on ever more complex models to understand data. Presenting the confidence of model predictions is a grand challenge. Faced with potentially hundreds or thousands of parameters, scientists often perform sensitivity analyses in order to assess the robustness of model predictions.
Such one-at-a-time calculations are useful but limited. Visualization techniques can provide a fuller picture, but the availability of immersive technologies is still expensive and not commonplace. We examine some simple data and discuss the presentation of uncertainty. Avenues for research are described.
|
|
Colloquium Speaker Haimeng Zhang Date: February 23, 2009 - IMU Persimmon Room 3 PM - 4 PM The Cox hazard regression model is a popular method used in epidemiological research to quantify the effects of prognostic factors on survival for a cohort of individuals followed over time. In practice, it is often difficult or expensive to collect complete data when dealing with large cohorts. Therefore, a number of practical sampling designs have been put forward. However, it is not always clear that the estimators in those designs use the given sampled data in the most efficient manner. In this presentation, I will discuss the asymptotic efficiency of the estimators from two popular sampling designs - case cohort and nested-case control sampling. In addition, by comparing the theoretical lower bound with the limiting distribution of the estimators, I will indicate in what instances the estimators achieve the lower bound, and what situations make for large efficiency losses.
|
|
Colloquium Speaker Jim Koehler Date: February 09, 2009 - IMU Persimmon Room 3:00 PM - 4:00 PM I'll provide a glimpse into the hidden life of statisticians inside of Google. I'll start by providing a general overview of Google's advertising system and the variety of roles for statisticians. Then I'll describe some specific business problems and how statistical methods contribute to their solutions. Finally, I'll introduce Google's efforts to partner with universities through the Google Online Marketing Challenge (student competition) and the Google University Research Awards.
|
|
Colloquium Guest Speaker George Mohler Date: December 01, 2008 - IMU Maple Room 3:00 - 4:00 PM Self-exciting spatial point process methods are well established in fields such as seismology, where the occurrence of an event increases the likelihood of another event nearby in space and time (in the case of earthquakes, aftershocks often follow a large event). This self-exciting behavior gives rise to particular types of data clustering and, surprisingly, such clustering is also observed in crime data (as it turns out, burglars will often return to the same house, or a house nearby, shortly after a prior offense and commit another burglary).
In this talk we show how self-exciting point processes can be used for the purposes of crime pattern modeling, simulation, and forecasting. We will first discuss the application of standard point process models, which involves background intensity estimation, maximum likelihood estimation of parameters, and model evaluation using tests for clustering. Next we will show how behavioral dynamics, present in recent agent-based models of crime, can be incorporated into the point process framework. For this purpose we use state-dependent stochastic differential equations, which can be viewed as generalizations of kernel-based models. We conclude by discussing several practical applications.
|
|
Colloquium - Speaker Guilherme Rocha Date: November 17, 2008 - IMU Maple Room 3:00 - 4:00 PM A 51-node Wireless Sensor Network has been installed on the Golden Gate Bridge for Structural Health Monitoring. There is a mismatch between the rate at which the data are collected at each node (~4 Kbps, kilo-bytes/second) and the rate at which they can be transmitted (~0.5 Kbps). The ultimate goal is to develop a data reduction scheme so the WSN can perform real time monitoring of the dynamic properties of the bridge. For the temperature data (~0.80 Kbps/sensor), lossless run length coding achieves a significant reduction in the data rate (~0.04 Kbps/ sensor). For the acceleration data, our strategy is to construct a restricted parametric model for the bridge and continuously adjust it as data become available. The restrictions applied to the model reflect both physical considerations and communication constraints. We report the results of such strategies on simulated data sets. Joint work with David Culler, James Demmel, Gregory Fenves, Sukum Kim, Shamim Pakzad, and Bin Yu.
|
|
Colloquium Guest Speaker Chuan Goh Date: November 03, 2008 - IMU Maple Room 12:00 - 1:00 PM This paper proposes a test for the correct specification of a dynamic time-series model that is taken to be stationary about a deterministic linear trend function with no more than a finite number of discontinuities in the vector of trend coefficients. The test avoids the consideration of explicit alternatives to the null of trend stability. The proposal also does not involve the detailed modelling of the data-generating process of the stochastic component, which is simply assumed to satisfy a certain strong invariance principle for weakly dependent processes. As such, the resulting inference procedure is effectively an omnibus specification test for segmented linear trend stationarity. The test is of Wald-type, and is based on an asymptotically linear estimator of the vector of total-variation norms of the trend parameters whose influence function coincides with the efficient influence function.
Simulations illustrate the utility of this procedure to detect discrete breaks or continuous variation in the trend parameter as well as alternatives where the trend coefficients change randomly each period. This paper also includes an application examining the adequacy of a linear trend-stationary specification with infrequent trend breaks for the historical evolution of U.S. real output.
|
|
Colloquium - Speaker Karen Kafadar Date: October 20, 2008 - IMU Maple Room 3:00 - 4:00 PM Microarray technology has made available large data sets that can provide information on gene expression when cells are subjected to various treatments. Before proceeding with a formal statistical analysis, many biological and procedural aspects should be considered. These aspects may guide the analysis and subsequent statistical inference. Several of these issues are discussed in connection with the analysis of oligonucleotide and cDNA microarray experiments. The particular focus in this article is on effects caused by the cDNA slide manufacturing process, appropriate transformations of the data, and on adjustments for background. A prescription for the analysis of microarray data is proposed and demonstrated using data from a cDNA experiment comparing the genetic expressions in two mouse cell lines; a candidate set of genes is identified for further study. The prescription may be modified for oligonucleotide microarray data.
|
|
Colloquium - Speaker Chunfeng Huang Date: October 06, 2008 - IMU Walnut Room - 12 PM -1 PM In the study of isotropic intrinsically stationary spatial processes, a new nonparametric variogram estimator is proposed through its spectral representation. The spectrum estimation is formulated in terms of solving a regularized inverse problem. A numerical implementation is presented through quadratic programming. We demonstrate our method in a simulation study and a dataset of temperature changes over America.
|
|
Colloquium - Speaker Steen Andersson Date: September 15, 2008 - IMU Persimmon Room 12 PM - 1 PM Classical Wishart distributions on the open convex cones of positive definite matrices and their fundamental features are extended to generalized Riesz and Wishart distributions associated with decomposable undirected graphs using the basic theory of exponential families. The families of these distributions are parameterized by their expectations/natural parameter and multivariate shape parameter and have a non-trivial overlap with the generalized Wishart distributions defined in Andersson and Wojnar (2004a,b). This work also extends the Wishart distributions of type I in Letac and Massam (2007) and, more importantly, presents an alternative point of view on the latter paper.
|
|
Fall 2008 Courses Date: May 20, 2008 - A schedule of classes is now available for the Fall 2008 semester.
|
|
Colloquium Guest Speaker Douglas Steinley Date: April 28, 2008 - IMU - Persimmon Room 12pm - 1pm A variance-to-range ratio variable weighting procedure is proposed. The method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the weighting technique. The performance of these procedures are compared to existing methods in the literature.
|
|
Colloquium Guest Speaker Richard Charnigo Date: April 24, 2008 - IMU - Walnut Room 4 pm - 5 pm Greater epidemiologic understanding of the relationships among fetal-infant mortality and its prognostic factors, including birthweight, could have vast public health implications. A key step toward that understanding is a realistic and tractable framework for analyzing birthweight distributions and heterogeneity in fetal-infant mortality. We propose describing a birthweight distribution using a normal mixture model in which the number of components is determined from the data, then estimating birthweight-specific mortality curves within each component of the normal mixture. We emphasize both methodological issues (e.g., How should the number of components be determined?) and interpretive issues (e.g., What do the components represent?). Data from the National Center for Health Statistics Public-Use Perinatal Mortality Data Files are used to compare our analytic framework to existing frameworks as well as to assess the reproducibility across repeated sampling of results obtained through our framework. (This talk is based on work with Lorie Wayne Chesnut, Tony LoBianco, and Russell S. Kirby.)
|
|
Colloquium Guest Speaker Michael Carbon Date: April 07, 2008 - IMU - Walnut Room 12 pm - 1 pm The purpose of this talk is to investigate the Frequency Polygon as a density estimator for stationary random fields indexed by multidimensional lattice points in space. Optimal binwidths which asymptotically minimize integrated mean square errors (IMSE) are derived. Under weak conditions, frequency polygons achieve the same rate of optimal uniform rate of convergence under general conditions. Rates of the a.s. convergence are given too. Finally, asymptotic normality of the frequency polygon estimator is established.
|
|
Colloquium Guest Speaker Zongwu Cai Date: March 20, 2008 - TBA - 4:00 pm - 5:00 pm Motivated by forecasting the inflation rate through nonstationary variables and efficient tests of stock return predictability as well as forecasts of the equity premium, this talk will focus on how to use nonparametric or semiparametric regression techniques to analyze nonstationary time series data. Development of a nonparametric approach to estimate the functionals will be discussed, as well as how the consistency and asymptotic normality of the proposed estimators are obtained. The asymptotic results have shown that the asymptotic bias is same for all estimators of functionals, but that the convergence rates are totally different for stationary and nonstationary covariates. These findings seem innovative in the literature.
|
|
Colloquium Guest Speaker Michael Levine Date: March 03, 2008 - IMU Walnut Room 12pm-1pm We consider a new separable nonparametric volatility model that allows for “interactions” in both mean and conditional variance (volatility) function. It can be concisely described as an additive-interactive nonlinear ARCH model. We propose this model as a possible alternative to the generalized additive nonlinear ARCH (GANARCH) model of Kim and Linton (2004), with which it shares the common origin. Unlike the GANARCH model, it does not assume the known link function but includes second-order interaction terms in both mean and variance functions instead. This ensures a much more data-driven model compared to GANARCH of Kim and Linton (2004) since our assumptions do not assume that anything know about the data distribution. This is very beneficial since, in practice, the data distribution has to be selected based on the exploratory data analysis, which is very difficult for multivariate data. Thus, the proposed model is much more flexible compared to GANARCH.
Motivated by the local instrumental variable estimation method (LIVE), also introduced in Kim and Linton (2004), we propose instrumental variable-based estimators of the components of the mean and volatility functions. The estimators are shown to be consistent and asymptotically normal. Explicit expressions for asymptotic means and variances of these estimators are also obtained. Several simulation experiments are conducted that show a very good performance of our algorithm for moderate sample sizes. Finally, the method is applied to the real data set of currency exchange rates where it leads to some interesting conclusions.
Historically, multiple functional component testing in nonparametric models has been a fairly difficult problem. We introduce a novel F-type approach to testing the significance of the two-way interactive terms in the mean function based on the unbalanced design ANOVA with unequal variances. Simulation studies show that the method performs very well for sample sizes of about 5000, which are easily available in financial applications.
|
|
Colloquium Guest Speaker Joanne Peng Date: February 25, 2008 - IMU Walnut Room 12pm-1pm For the past 25 years, advances have been made in missing data methods. Most published work has focused on missing data in the dependent variable under various conditions. The present study sought to fill the void by comparing two approaches for handling missing data in categorical covariates in logistic regression. These two approaches were EM method of weights and multiple imputation.
Sample data were first drawn randomly from a population with known characteristics. Missing data on covariates were subsequently simulated under two conditions: missing completely at random and missing at random with different missing rates. A logistic regression model was fit to each sample using either the EM or the MI approach. The performance of these two approaches was assessed on four criteria: bias, efficiency, coverage, and rejection rate.
Results generally favored MI over EM. Practical issues such as implementation, inclusion of continuous covariates, and interactions between covariates were discussed.
|
|
Colloquium Guest Speaker Vivekananda Roy Date: February 14, 2008 - IMU Walnut Room 4 pm - 5 pm We study Markov chain Monte Carlo algorithms for exploring the intractable posterior density that results when a probit regression likelihood is combined with a flat prior on the regression coefficient. We prove that the data augmentation algorithm of Albert and Chib (1993) and the PX-DA algorithm of Liu and Wu (1999) both converge at a geometric rate, which ensures the existence of central limit theorems (CLTs) for ergodic averages under a second moment condition. While these two algorithms are essentially equivalent in terms of computational complexity, we show that the PX-DA algorithm is theoretically more efficient in the sense that the asymptotic variance in the CLT under the PX-DA algorithm is no larger than that under Albert and Chib's algorithm. A simple, consistent estimator of the asymptotic variance in the CLT is constructed using regeneration. As an illustration, we apply our results to the lupus data from van Dyk and Meng (2001). In this particular example, the estimated asymptotic relative efficiency of the PX-DA algorithm with respect to Albert and Chib's algorithm is about 65, which demonstrates that huge gains in efficiency are possible by using PX-DA algorithm.
|
|
Colloquium Guest Speaker Jien Chen Date: February 04, 2008 - IMU Walnut Room 12pm-1pm As a non-parametric method, Empirical Likelihood (EL) has been attracting serious attention from researchers in statistics, econometrics, engineering and biostatistics. By defining the estimation equations in EL appropriately, we can extend EL to various data settings, especially those in which parametric likelihoods are absent. In this talk, I will provide two examples of such extensions: quantile estimation and longitudinal data analysis. Quantile estimation for discrete data analysis has not been well studied. For a given 0 < p < 1, the commonly used sample quantile may or may not be consistent for the pth quantile, depending on whether or not the underlying distribution has a plateau at the level of p. I propose an EL-based categorization procedure that not only helps determine the shape of the true distribution at level p, but also provides a way of formulating a new estimator that is consistent in any case. For non-Gaussian longitudinal data, generalized estimating equations (GEE) are a popular class of marginal models. While the GEE estimator is consistent and robust, it may suffer significant loss of efficiency if the working correlation structure is misspecified. I consider the use of EL to select working correlations for GEE models, for which parametric likelihoods are absent and quasi-likelihoods are difficult to construct.
|
|
Colloquium Guest Speaker Brian Reich Date: January 31, 2008 - IMU Walnut Room 4 pm - 5 pm Storm surge, the onshore rush of sea water caused by the high winds and low pressure associated with a hurricane, can compound the effects of inland flooding caused by rainfall, leading to loss of property and loss of life for residents of coastal areas. Numerical ocean models are essential for creating storm surge forecasts for coastal areas. These models are driven primarily by the surface wind forcings. Currently, the gridded wind fields used by ocean models are specified by deterministic formulas that are based on the central pressure and location of the storm center. While these equations incorporate important physical knowledge about the structure of hurricane surface wind fields, they cannot always capture the asymmetric and dynamic nature of a hurricane. A new Bayesian multivariate spatial statistical modeling framework is introduced combining data with physical knowledge about the wind fields to improve the estimation of the wind vectors. Many spatial models assume the data follow a Gaussian distribution. However, this may be overly-restrictive for wind fields data which often display erratic behavior, such as sudden changes in time or space. In this paper we develop a semiparametric multivariate spatial model for these data. Our model builds on the stick-breaking prior, which is frequently used in Bayesian modeling to capture uncertainty in the parametric form of an outcome. The stick-breaking prior is extended to the spatial setting by assigning each location a different, unknown distribution, and smoothing the distributions in space with a series of kernel functions. This semiparametric spatial model is shown to improve prediction compared to usual Bayesian Kriging methods for the wind field of Hurricane Ivan.
|
|
Colloquium Guest Speaker Guilherme Rocha Date: January 28, 2008 - IMU Maple Room - 12pm - 1pm Extracting useful information from high-dimensional data is an important focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task. Quasi-norms on model parameters are frequently used as a penalty. Classical examples are AIC and BIC where the L0 quasi-norm (model dimension) is used as a penalty.
More recently, penalization by the L1-norm (lasso) has enjoyed a lot of attention. L1-penalized estimates are cheaper to compute (convex optimization) and lead to more stable model estimates than their L0 counterparts.
In this talk, I will present the Composite Absolute Penalties (CAP) family of penalties. CAP penalties allow given grouping and hierarchical relationships between the predictors to be expressed. They are built by defining groups of variables and combining the properties of norm penalties at the across group and within group levels. Grouped selection occurs for non-overlapping groups. Hierarchical variable selection is reached by defining groups with particular overlapping patterns.
Under easily verifiable assumptions, CAP penalties are convex: an attractive property from a computational stand-point. Within this subfamily, unbiased estimates of the degrees of freedom (df) exist so the regularization parameter is selected without cross-validation.
Simulation results show that CAP improves on the predictive performance of the LASSO for cases with p>>n and mis-specified groupings.
This is joint work with Peng Zhao and Bin Yu.
|
|
Colloquium Guest Speaker Junhui Wang Date: January 14, 2008 - IMU Maple Room - 12pm - 1pm Hierarchical classification is critical to knowledge and context management as well as knowledge exploration, as in gene function classification and discovery and document categorization. In hierarchical classification, an input is classified by a structured hierarchy. In a situation as such, the central issue is how to effectively utilize inter-class relationship to improve the generalization performance of flat classification ignoring such dependency. In this talk, a novel large margin method based on constraints characterizing multi-path hierarchy is presented within the framework of regularization. In particular, I will discuss three aspects: (1) the idea and methodology development; (2) computational tools; (3) a statistical learning theory. Numerical examples will be provided to demonstrate the advantage of our proposed methodology against other existing competitors. An application to gene function prediction and discovery will be discussed.
|
|
Colloquium Guest Speaker Guang Cheng Date: January 10, 2008 - IMU Walnut Room 4 pm - 5 pm Semiparametric modeling is an excellent framework due to its flexibility to model some features parametrically without making assumptions on the other features. However, the infinite-dimensional nuisance parameter in the semiparametric models generally poses several challenges for making maximum likelihood inference for the parameter of interest at both theoretical and methodological levels. We will consider a series of profile likelihood based semiparametric inference procedures either based on numerical methods, i.e. K-step MLE, or through MCMC sampling, i.e. the Profile Sampler and the Penalized Profile Sampler. All the above profile likelihood based methods avoid evaluation of the infinite-dimensional operator and are easy to implement. Furthermore, we investigate their second order asymptotic behaviors, which are proven to be related to the convergence rate of the nuisance parameter and thus adjustable.
|
|
Guest Colloquium Speaker William Cleveland Date: November 26, 2007 - IMU Walnut Room 12pm-1pm Large, complex data sets are ubiquitous, the standard now rather than the exception. They present challenging problems of analysis because of their size and the complexity of their data structures and patterns. One approach is to compute summary statistics at the outset to reduce the complexity, but this expedient risks losing important information in the data. The goal should be lossless analysis: analyze the data at a level of detail and comprehensiveness that does not sacrifice
information.
Achieving lossless analysis of complex data today is immensely challenging. New fundamental approaches and methods are needed for each of the different areas that come into play in the analysis of the data --- databases, data processing, data structures, statistical models and methods, machine learning algorithms, data
visualization, computational algorithms, software environments, and hardware environments. In fact, it has never been harder to achieve lossless analysis because complexity has increased faster than our innovations in these areas.
Nothing serves lossless analysis better than data visualization, the only practical way to absorb large amounts of information in detail. But for today's complex sets we must visualize far larger amounts than in the past. We must be ready to accept large displays each covering tens or even hundreds of screensful (pages). For a single data set it is reasonable to have hundreds of such displays. These displays become a new database produced from the data that is queried and studied. For a display of 500 pages, we might query and study all or just a few of the pages depending on the task.
Producing, querying, and studying a visualization database needs new ideas. There are different modes of viewing the many pages and panels per page of a large display, from slow focused study to very rapid scans. We need creative interfaces to facilitate the different modes. We cannot fuss with very large displays, interacting with the micro-elements to get them right, because there is too much; instead there should be smart automation algorithms that get the large display right the first time. We must consider the physical screen space, its size and resolution, to make it work most effectively for the visual study. We need methods of display that result in pre-attentive visual formation of gestalts that show instantaneously the relevant patterns in the data. This necessitates, strangely, more displays, starting with broad brush looks to derivative displays whose redesigns show specific aspects of the broad brush more effectively. It also requires the study of visual perception.
|
|
Colloquium Guest Speaker Tonu Kollo Date: November 12, 2007 - IMU Walnut Room 12pm-1pm In the last ten years, remarkable development has occurred in the area of skewed multivariate distributions. Skew normal distribution was introduced in 1996 by Azzalini & Dalla Valle (Biometrika). Azzalini’s construction of the distribution was very fruitful and was later successfully applied to many other elliptical families of distributions. Random vector X is skew normally distributed with parameters α – a p-vector as the shape parameter and Σ - a positive definite p×p-matrix as the scale parameter when the density function of X is the product of the density function of N(0,Σ) and the distribution function of the standard normal distribution with the shape parameter appearing in the argument of the distribution function.
In the talk, basic properties of the skew normal family will be discussed and other more often used families of skewed elliptical distributions will be examined (multivariate skew t-distribution, for instance). With new families of distributions, new estimation and testing problems have risen. Classical estimation methods do not often work: maximum likelihood method can give wrong estimates and much bias is possible using moments’ method.
Another type of skewed multivariate distributions is presented by asymmetric Laplace distribution, which was carefully examined by T. Kozubowski in a series of papers at the end of 1990s. In this case, we do not have explicit expression for the density function and estimation, testing and fitting problems have to be solved on the basis of the characteristic function. This distribution will also be considered in more detail.
|
|
Colloquium Guest Speaker Victor Goodman Date: November 05, 2007 - IMU Walnut Room 12pm-1pm Forward Interest rates are simultaneously measured using up-to-the-minute bond trading quotes. The time variation of these rates determines a high-dimensional covariance matrix that might be used to model bond yields within a country's government bond market. Several PC analyses, with 1989-92 data in the U.K. market, 1887-94 data in the U.S. market, and 2001-05 data in the U.S. market, reveal a striking pattern involving the first three eigenvectors of each covariance matrix. In this talk I describe the pattern and make the (well-known) case for using a three-factor Gaussian model to describe bond trading.
It is difficult to implement models based on PC estimates for the first eigenvector. An initial attempt to produce a model in 1988 ended in failure since the model had a financial inconsistency. A more recent model behaves better; its covariance has the desired eigenvector and the model is arbitrage-free. Surprisingly, the new model appears when we condition prices not to collapse in the old model. This suggests that an issue of survivorship may arise even in no-default bond markets.
|
|
Colloquium Guest Speaker Jerome Busemeyer Date: October 29, 2007 - IMU Walnut Room 12pm-1pm Social and behavioral scientists face some of the same measurement problems that forced physicists to abandon classical probability theory. Their measurements are often incompatible, and the first measurement may disturb a second measurement. Thus only partial information about a complex system can be obtained at any point in time. Combining partial information about a system into a coherent understanding of the entire system is the hallmark of quantum theory. Quantum theory provides a fundamentally different approach to logic, reasoning, and probabilistic inference. For example, quantum logic does not always follow the distributive axiom of Boolean logic; quantum probabilities do not always obey the Kolmogorov law of total probability; quantum reasoning does not always obey the principle of monotonic reasoning.
For this talk, I will present a tutorial of the basic assumptions of classic versus quantum probability theories. These basic assumptions will be examined, side by side, in a parallel and elementary manner. Classic theory will emerge as a possibly overly restrictive case of the more general quantum theory. The fundamental implications of these contrasting assumptions for measurement in the social and behavioral sciences will be examined.
|
|
Colloquium Guest Speaker Juan Carlos Escanciano Date: October 22, 2007 - IMU Redbud Room SPECIAL TIME: 11:30AM - 12:30PM A general method for testing the martingale difference hypothesis is proposed. The new tests are data-driven smooth tests based on the principal components of certain marked empirical processes that are asymptotically distribution-free, with critical values that are already tabulated. The smooth tests are shown to be optimal in a semiparametric sense discussed in the paper, and they are robust to conditional heteroscedasticity of unknown form. A simulation study shows that the data-driven smooth tests perform very well for a wide range of realistic alternatives and have more power than omnibus and other competing tests. Finally, two empirical examples highlight the merits of our approach.
|
|
Spring 2008 Courses Date: October 05, 2007 - Click here to view list of Spring 2008 Courses.
|
|
|