Statistical Modeling and Inference Group

1.     Genomic Signal Processing

2.      Statistical Modeling and Inference

3.      Signal and Data Compression

4.     Information Theoretic Methods


·         Prof. Ioan Tabus (tabus


·         Prof. Jaakko Astola (jta(at*) )

·         Prof. Jorma Rissanen (Jrrissanen(at*)

·         PhD. Bogdan Dumitrescu (bogdand(at*)

·         PhD. Ciprian Doru Giurcaneanu (cipriand(at*)

·         PhD. Andrea Hategan (andrea(at*)

·         MSc. Florin Ghido (florin.ghido(at*)

·         MSc. Antti Liski (antti.liski(at*)

·         MSc. Jenni Hukkanen (jenni.hukkanen(at*)

·         MSc. Seyed Alireza Razavi (alireza.razavi(at*)

·         MSc. Septimia Sarbu (septimia.sarbu(at*)

·         MSc. Alexandru Onose (alexandru.onose(at*)

·         MSc. Yinghua Yang (yinghua.yang(at*)


Former members:

·         PhD. Razvan Iordache(razvan.iordache(at*)

·         PhD. Adriana Vasilache(adriana.vasilache(at*)

·         PhD. Riitta Niemistö (riitta.niemisto(at*)

·         PhD. Gergely Korodi (gkorodi(at*)

·         PhD. Cristian Mircean (cristian.mircean(at*)


* replace with "@" - for spam reasons



Genome Signal Processing

·         uncovering gene expression regulatory mechanisms  from gene expression time series

·         prediction, diagnosis, and disease classification based on gene expression data

·         models of DNA sequences for compression and inference

Statistical Modeling for Signal and Data Compression

·         Lossless audio compression

·         Data compression algorithms based on tree models

·         Compression of biological sequences

The research group is in charge of the “Information Theoretic Methods” part of the Center of Excellence of Academy of Finland “Signal Processing Algorithm Groups”

Information Theoretic Methods in Signal Processing (Activity Report, October 2009)

The information theoretic methods in signal processing are the major routes towards systematic and principled selection of the structure of various models by properly balancing the ability of fitting data and the complexity of the models.

Our investigations were directed both to developing new methodologies in the family of information theoretic methods and to finding new application areas where the newly derived methods could reveal their potential. The applications of data compression to full genome compressions achieved the best existing results for the human genome. A clearly tangible use of data compression in bioinformatics is to represent DNA sequences in a compact form, which is optimized for mass storage or data transmission. Additionally, the uncovering of the most effective statistical models has a profound value in understanding the patterns and interrelationships present in the biological sequence. Another highlight of the results achieved is the lossless audio compression research, where we obtain the best existing combination of compression performance and encoding/decoding speed. The field of statistical inference based on information theory expands very quickly to all areas of science and engineering, with some of the trends captured by the two workshops co-organized by our group in 2008 and 2009 [826,832] and also with the notable contributions presented in the Shannon lecture presented by Jorma Rissanen [23].

1.1. Signal and Data Compression Based on Statistical Modelling
1.1.1. New Compression Methods for Sources with Memory

We derived exact and implementable solutions to the problem of universal coding using the normalized maximum likelihood model for the class of Markov sources. The case of the first order models was analyzed in [498] while the derivations for general order models and for additionally accounting for typical implementation constraints were presented in a book chapter [51] and in [690]. These exact evaluations can be used for example in inference about the underlying structure of the sequence, and for segmenting the haplotype data [573].

In 2008-2009 we utilized universal models with memory for the segmentation of the duplicated genes in genomes [691] showing that we can better infer the duplicated stretches of DNA if we model the approximate matching as a process with memory, with further results presented in [826].

1.1.2. Compression with Random Access to the Data Stream

One typical application is DNA compression, where besides a good compression performance one wants to provide random access to the data stream. We have shown ways to ensure random access for functional elements in DNA sequences, without decompressing the whole file in [372,497].

1.1.3. Short String Compression Using Optimal Tree Machines

We provided a general methodology for building optimal tree machines for sources with memory by using a pruning strategy for the complete tree machine derived from a training set. We then combined the optimal resulting model with our previous Tunstall coding for sources with memory [371].

In 2008-2009 we have applied the optimal tree machine construction as a basic tool for inference for language identification of individual words in a multilingual automatic speech recognition system, which was presented in a conference [734].
1.1.4. Lossless Signal Compression of Audio, ECG and Images

We provided new solutions to several fundamental problems in lossless compression: optimal interaction between the predictor and the quantizer (in forward, or frame based, coding)[352,692], solutions for exploiting the correlation of the stereo channels [351,470], compensating the nonlinear effects (saturation) present in most of the audio material [469,471], optimal segmentation into frames by an MDL criterion [437].

In 2008-2009 we extended the results obtained for the interleaved optimization-quantization presented for order-recursive algorithms in [692], to the least square estimation of direct form AR parameterization, showing at ICASSP’2008 a spectacular improvement of the lossless compression factor and of the encoding-decoding speed [604], when compared to the existing standard MPEG-ALS. Further points of refinement of the lossless audio schemes were pursued for bit allocation of the linear prediction coefficients in [603] and for carefully benchmarking the compression and speed performance of our lossless audio compression algorithms in [602].

1.1.5. Lossy signal compression

The ideal rate-distortion curves can be achieved by unbounded random dictionaries, as already known from Shannon’s rate-distortion theory and attempts to more practical constructs, without employing dictionaries are still very complex.

In 2008-2009 we provided a constructive method to achieve good rate-distortion performance without utilizing large dictionaries and we analyzed the performance for the case of memoryless models, in a book chapter [56].

We proposed a new low bit rate quantization scheme by using Golay dictionaries at DCC’2009 [784] and in the patent application [845]. The bottleneck of complexity being the search on each separate shell of the Golay codebooks, we derived two types of fast search algorithms submitted as patent applications, one utilizing the cyclic shifting property of the Goaly code vectors [847] and the other utilizing the properties of the hexacode for organizing the search of the nearest Golay neighbour on a given shell in the most economical way [846]. Further simplifications of the overall greedy optimization procedure led in [787,788] to low bit rate quantization schemes outperforming the traditionally RE8 lattice schemes  at SNR performance and having a lower computational cost.

1.2. Statistical Modelling and Inference Based on Information Theoretic Methods

A very interesting connection between Kolmogorov complexity and hypothesis testing was proposed and analyzed. This topic has been investigated with B. Ryabko from the Siberian State University of Telecommunication and  Computer Science, Novosibirsk, Russia, in a Theoretical Computer Science article [109] and in Journal of Statistical Planning and Inference article[110].

In 2008-2009 we proposed a number of new contributions starting from the newest approach to composite hypothesis testing proposed by Rissanen, which relies on the concept of optimally distinguishable distributions (ODD). In [671], we derived the ODD detector for the classical linear model. The work was further extended in a paper published in IEEE Transactions on Signal Processing (see [249]), where we provided answers to the following problems that have not been previously investigated in the literature: (i) the relationship between ODD and the widely used Generalized Likelihood Ratio Test; (ii) the connection between ODD and the Information Theoretic Criteria (ITC) applied in model selection.

For most of the models, the computation of the stochastic complexity is difficult, and various approximations have been proposed. We considered two approximations based on the Fisher information, and we demonstrated their performance in solving the following problems: order selection for AR and ARMA models [26], estimation of the number of sine-waves in various types of noise [43,472,473], and order selection for Markov models [474].

Since the ITC have been derived under the assumption that the measured signals are stationary, it is not straightforward to employ them in combination with the forgetting factor least-squares algorithms (FFLSA). In [268,606], we modified the Predictive Densities Criterion and the Sequentially Normalized Maximum Likelihood criterion such that to be compatible with the FFLSA. Additionally, we provided rigorous proofs concerning the asymptotic approximations of other four modified ITC.

Based on the MDL principle, the dependence between time series was recast to reflect the predictability of each of the two time series from the other, and the method provided a measure of the synchronous oscillatory brain activity for comparing the EEG response of epileptic and control children [254,414].

1.2.1. MDL Based Statistical Inference for Genomics

Important estimation problems can be formulated in the new field of genomics and proteomics, with the unusual demand of estimating hundreds of parameters. We derived maximum likelihood (ML) estimators and corresponding MDL model order selection methods, for a new technology in proteomics, and applied it to several biological problems. This lead to a IEEE Transactions in Signal Processing article [115], two other journal papers [61,81] and a conference paper [358]. This is joint work with the group of W. Zhang at MD Anderson Cancer Center, USA.

We extended the gene shaving algorithm for gene clustering and applied it to new genomic problems in [25]. This is joint work with G. Fuller from MD Anderson Cancer Center, USA

NML models for discrete regression

Discrete regression models appear in many classes of problems in genomics: classification, prediction, regulatory network modelling. A unified derivation of the normalized maximum likelihood (NML) model for different classes of discrete regression was introduced in a book chapter [32] and further refined in [563]. The same type of methodology was successfully applied also to the logit regression models in the book chapter [33].

In 2008-2009 we introduced new versions of MDL for linear models [46], for the case with data constraints [53] and for knot selection in spline models [635].

1.2.2. Extension of MDL to Modelling With Distortion

The Kolmogorov structure function, can be seen to be similar to using MDL with distortion, and both these concepts can be utilized to modelling regulatory network of genes. The initial idea was proposed in a conference paper [460].

In 2008-2009 we proposed a method for inference of gene regulatory networks based on a universal model and Kolmogorov structure function, published in EURASIP Journal of Bioinformatics and Systems Biology [184].

1.2.3. Normalized Maximum Likelihood (NML) Models for Analysis and Compression of Sequences from Small Alphabets 

We analyzed and compared different compression algorithms employed for DNA sequences, showing the superiority of our normalized maximum likelihood model in a IEEE Signal Processing Magazine article [137].

1.2.4. Efficient NML Algorithms for Analysis and Compression of DNA Sequences

We proposed a specialized compression program for DNA sequences, able to compress the full human genome with the best compression results known to date, about 1.7 bits per base for most human chromosomes (except the Y chromosome, which requires only about 1.2 bits per base) [498].

We derived and extensively tested the best compression algorithm for annotated DNA sequences, which is based on a grammar analyzer for the textual part of the annotation, while for the sequence part, it uses grammar substitutions and NML based compression. The practical implementation of the algorithm achieves the best results known for these DNA files, compressing to about 1 bit each character of the original file. The results were published in IEEE/ACM Transactions on Computational Biology and Bioinformatics [136].

1.2.5. Analysis and Compression Based on Global Sequence Alignments for Proteome Sequences

Most proteins files carry along with the amino acid sequence the information about the secondary structure of the proteins. Jointly encoding of these two strings leads to an overall better compression [479], but also gives rise to the possibility to use the compression results for predicting the secondary structure of the proteins [42].   

1.2.6. Optimization with positive polynomials

We applied the general theory of positive polynomials to the case of multivariate trigonometric polynomials and connected it to convex optimization. The most significant theoretical contribution is related to polynomials that are positive on semialgebraic sets, with application to 2-D FIR filter design [68] in the minimax sense; the distinctive feature of the method is the expression of frequency mask constraints as linear matrix inequalities and the formulation of the whole optimization problem in the semidefinite programming (SDP) framework, where it can be solved with freely available libraries; fast algorithms for this family of problems were proposed in [162].

In 2008-2009, several other applications of this theory were pursued, resulting in papers published in IEEE Trans. on Signal Processing, IEEE Signal Processing Letters and other journals. Stability tests for multidimensional systems, investigated previously in [70], have now been enhanced to the Fornasini-Marchesini model [187], using an SDP problem that is practically equivalent to testing if a matrix polynomial is nonsingular on a frequency domain. Among the applications related to compression, we have optimized dual-tree orthogonal wavelets using the positive trigonometric polynomials approach in [189] for the critically sampled case and in [265] for higher density wavelets. The theoretic foundation of the design algorithms consists of a parameterization of the convex domain of free parameters describing the generating filters (the free parameters result after imposing perfect reconstruction and regularity constraints); the parameterization helps in structuring the search procedure used in the optimization. Another publication [188] relates the design of dual-tree biorthogonal wavelets to fractional delay approximation, approached via a Bounded Real Lemma for trigonometric polynomials, expressed as a linear matrix inequality.

Other applications that use the theory of positive polynomials are in the design of nonuniform oversampled filter banks [186] that are composed of sections of uniform banks. An extension of the theoretical results has been made in [264], where positive hybrid polynomials are studied; these polynomials depend on both real and complex variables and an example of application is the design of variable FIR filters implemented with a Farrow structure.