View on GitHub

MADAM Seminar

Methods for Analysis of Data - Algorithms and Modeling



Seminar info



Thursdays (selected), 11:05-12:30


room 211 @ Faculty of Mathematics and Information Science, WUT, ul. Koszykowa 75, Warsaw



Date Author Title
20.09.2018 Anna Cena Adaptive hierarchical clustering algorithms based on data aggregation methods
04.10.2018 Agnieszka I. Geras Power law and its application
29.11.2018 Barbara Żogała-Siudem Variable selection algorithms for linear models based on multidimensional index
28.02.2019 Grzegorz Siudem Interpolation with an arbitrary precision



Interpolation with an arbitrary precision

Grzegorz Siudem [WUT]
The problem of the interpolation (e.g. fitting to some experimental data) is a well-known case in the numerical analysis. There are plenty of the typical solutions, however, most of them fail when one expects the results with arbitrarily high precision. In this talk, we present the long story of such failures which happily ends with a surprisingly elegant solution. Joint work with G. Świątek.

Variable selection algorithms for linear models based on multidimensional index

Barbara Żogała-Siudem [SRI PAS]

Power law and its application

Agnieszka I. Geras [WUT]
The normal distribution is commonly identified in natural sciences. However, it may not be the "dominant" distribution. We observe many phenomena which are extreme valued, heavy-tailed or described by power-law type functions. In this talk we will present statistical framework for discerning and quantifying power-law behaviour in empirical data, shed some light on the origins of power-law (the famous "the rich get richer rule") and take a look at interesting applications in the latest research concerning human online behaviour.
Thursday 16:15, room 431, Math building, WUT.

Adaptive hierarchical clustering algorithms based on data aggregation methods

Anna Cena [SRI PAS]
Cluster analysis aims at determining an input data set's partition in such a way that the observations within each group are as similar (with respect to a given criterion) as possible to each other, while diversifying those from different groups. In this presentation, we introduce a new hierarchical agglomerative method which acts on a minimum spanning tree and is based on the partial information on the data structure obtained by applying the Genie algorithm. The initial, partial grouping is determined in an adaptive manner by computing the intersection of the partitions generated by the Genie method with a wide range of the inequity measure's threshold. Moreover, each element of the nested family of partitions is generated according to the information criterion or by minimizing a certain nearest neighbor-based dissimilarity with specific constraints.
Also, results of the investigation concerning the linkage schemes proposed by Yager in 2000 and then re-introduced by Nasibov and Kandemir-Cavas in 2011, which are generated by the weighted ordered averages -- the OWA operators, generalizing the single, complete, and average linkages will be discussed.

Schedule - 2017/2018

Date Author Title
24.11.2017 Grzegorz Siudem How accidental scientific success is?
05.01.2018 Jan Lasek Measuring the efficacy of league formats
26.01.2018 Marek Gagolewski Stochastic properties of discrete Sugeno integrals
14.03.2018 Antoni Ruciński Analysis of the Warsaw rail transport network
21.03.2018 Agnieszka I. Geras Should we introduce a ‘dislike’ button for papers?
28.03.2018 Raúl Pérez-Fernández Aggregation through the poset glass
08.06.2018 Hossein Yazdani Bounded Fuzzy Possibilistic Method of Critical Objects Processing
21.06.2018 Maciej J. Mrowiński Cartesian Genetic Programming with memory
20.06.2018 Maciej Bartoszuk A source code similarity assessment system
27.06.2018 Mateusz Wiliński Detectability of Macroscopic Structures in Directed Network

Abstracts - 2017/2018


Detectability of Macroscopic Structures in Directed Networks: a Stochastic Block Model Approach

Mateusz Wiliński [Scuola Normale Superiore di Pisa]
Disentangling network macroscopic structures is one of the funding problems in complexity science and also an important subjects in other fields like mathematics, physics or computer science. It is also interesting from the point of view of machine learning and data mining. One of the most basic models of communities in networks are the stochastic block models. It was recently shown in that in this case the detectability of real communities only from the network topology is limited. Even though the results were shown only for planted partition, where there are only two parameters, the conclusions are universal.
We examined a more general case of directed stochastic block model. More interestingly, we shown that by introducing a dissymmetry of direction, we are able to increase the range of the detectable phase. Importantly, this qualitative change holds for an entire class of hardly detectable models, where both the average in- and out-degree are the same across all groups. During my presentation I will show this unintuitive result both by means of numerical simulations and with an elegant analytic approximation.

Thursday 16:15

Cartesian Genetic Programming with memory

Maciej J. Mrowiński [WUT]
Cartesian Genetic Programming (CGP) is an evolutionary programming algorithm whose purpose is to evolve computer programs using concepts inspired by natural selection. It's range of applications is wide and includes problems like optimisation or image processing. Programs in CGP are encoded as graphs. The structure of an encoded CGP program is very similar to a multilayer perceptron network with different activation functions assigned to nodes and with equal weights of all connections. A CGP program is evolved using the so called 4+1 algorithm, which tries to maximise a user-provided fitness function by creating, via mutation, new generations of programs.
Recurrent Cartesian Genetic Programming (RCGP) is a variant of CGP which allows cycles in the graphs representing programs, which implicitly introduces memory into CGP. The addition of memory broadens the range of applications of CGP and makes it a more viable tool for problems like time series forecasting. In our work, we propose a modification of CGP which results in an explicit inclusion of memory. We achieve this by directly including a shift register in each node of the CGP graph and constantly providing these registers with values processed by nodes. Thanks to this approach (which we call SRMCGP Shift Register Memory CGP), users gain fine-grained control over the memory in the program, which is not possible in RCGP, and avoid forward, recurrent connections. In order to study the memory capabilities of RCGP and SRMCGP, we performed numerical simulations of programs whose purpose was to memorise and repeat the input signal after a given number of time steps. Our results suggest that SRMCGP is much more efficient than RCGP [UTF-8?]– usable solutions/programs can be acquired faster through SRMCGP. Additionally, SRMCGP results in a smaller number of active nodes which makes SRMCGP programs less costly (it terms of computation time) to decode. SRMCGP is also more likely to actually create usable solutions/programs.

A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods

Maciej Bartoszuk [WUT]
This presentation deals with the problem of detecting similar source codes (also known as clones) in functional languages, especially in R. Code clones' detection is useful in e.g. programming tutoring. This work introduces a new approach to evaluate performance of proposed methods. What is more, the current state-of-the-art code clone detection approaches based on Program Dependence Graphs (PDG) rely on some exponential-time subroutines. In this work, the performance of a polynomial-time algorithm based on the Weisfeier--Lehman method for finding similar graphs is examined. Since all the algorithms for comparing source codes focus on different code similarity aspects, the current work also proposes new similarity aggregation method (based on B-splines curves and surfaces) of the results generated by different approaches. The new models not only posses an intuitive interpretation, but also were experimentally shown to yield better results. In this work for the first time non symmetric measures were proposed, quantifying the degree to which one function is contained within a second one and vice versa were proposed.
Friday 12:15, SRI PAS room 200

Bounded Fuzzy Possibilistic Method of Critical Objects Processing in Machine Learning

Hossein Yazdani [Wroclaw University of Science and Technology]
Unsatisfactory accuracy of conventional methods used in machine learning is mostly caused by omitting the influence of important factors in learning procedures. Type of data objects, membership assignments, and distance or similarity functions are among others such parameters. Improper accuracy of prototype-based (centroid-based) methods and causes of misclassifications have been fully studied. Data objects are considered as fundamental keys in learning methods and knowing the exact type of objects leads to provide a suitable environment for learning algorithms. A new type of object called critical object has been introduced that plays the important role in each dataset. These objects are considered as the causes of misclassification in learning methods. Objects' movements have been also analyzed and then a new method to handle and track the objects' movements has been proposed. In other words, the main goal was to introduce a new method with special processing of critical objects. The new proposed method called Bounded Fuzzy Possibilistic Method (BFPM) addresses several issues that previous clustering/classification methods have not considered. In fuzzy clustering, the object's membership values should sum to 1. Hence, any data object may obtain full membership in at most one cluster. Possibilistic clustering methods remove this restriction. However, the BFPM method differs from previous fuzzy and possibilistic clustering approaches by allowing the membership function to take larger values with no restriction. Furthermore, in the BFPM method, a data object can obtain full membership in multiple clusters or even in all clusters. A new type of feature called as dominant has been introduced that is considered as another cause of misclassifications. And then new similarity functions called Weighted Feature Distance (WFD) and Prioritized Weighted Feature Distance (PWFD) have been proposed to cover diversity in vector and feature spaces, as well as handling the impact of dominant features.
In experimental verifications, the most well-known benchmark datasets available in the Internet that have been also used in tests described in many scientific publications are chosen. Fuzzy C-Means function and algorithms as well as advanced modified prototype-based methods have been compared with the proposed method. The new method was compared with conventional supervised and unsupervised learning methods in terms of accuracy. Promising results achieved by the experiments prove that the BFPM method ensures better accuracy than conventional learning methods due to taking into account the critical objects and dominant features.

Aggregation through the poset glass

Raúl Pérez-Fernández [Ghent University]
The aggregation of several objects into a single one is a common study subject in mathematics. Unfortunately, whereas practitioners often need to deal with the aggregation of many different types of objects (rankings, graphs, strings, etc.), the current theory of aggregation is mostly developed for dealing with the aggregation of values in a poset. In this presentation, we will reflect on the limitations of this poset-based theory of aggregation and "jump through the poset glass". On the other side, we will not find Wonderland, but, instead, we will find more questions than answers. Indeed, a new theory of aggregation is being born, and we will need to work together on this reboot for years to come.

Should we introduce a ‘dislike’ button for academic papers?

Agnieszka I. Geras [WUT]
Citations scores and the h-index are basic tools used for measuring the quality of scientific work. Nonetheless, while evaluating academic achievements one rarely takes into consideration for what reason the paper was mentioned by another author - whether in order to highlight the connection between their work or to bring to the reader’s attention any mistakes or flaws. In my talk I will shed some insight into the problem of "negative" citations analyzing data from the Stack Exchange and using the proposed agent-based model. Joint work with M. Gągolewski and G. Siudem.

Analysis of the Warsaw rail transport network

Antoni Ruciński [WUT]
Have you ever wondered how the transport system really works and what are the trams traffic rules? What is the impact of a single tram stop to the entire network? What would happen if the most significant stop is cancelled? No? Don't worry! I guess that thousand of people travelling by public transport everyday have not wondered neither.
Online and timetable data has made it possible to answer to the questions and indicate the most meaningful features of the network. On one hand various centrality measures have been calculated to locate the most influential nodes. On the other, percentage share of individual tram lines or low-floor trams has given information about the traffic organisation.
What is the trams' ability to adhere to a timetable? Which lines are the most delayed and why? Don't know? Come and listen to my speech!

Stochastic properties of the Hirsch index and other discrete Sugeno integrals

Marek Gagolewski [WUT & SRI PAS]
Hirsch's h-index is perhaps the most popular citation-based measure of scientific excellence. Many of its natural generalizations can be expressed as simple functions of some discrete Sugeno integrals. In this talk we shall review some less-known results concerning various stochastic properties of the discrete Sugeno integral with respect to a symmetric normalized capacity, i.e., weighted lattice polynomial functions of real-valued random variables - both in i.i.d. (independent and identically distributed) and non-i.i.d. (with some dependence structure) cases. For instance, we will be interested in investigating their exact and asymptotic distributions. Based on these, we can, among others, show that the h-index is a consistent estimator of some natural probability distribution's location characteristic. Moreover, we can derive a statistical test to verify whether the difference between two h-indices (say, h'=7 vs. h''=10 in cases where both authors published 40 papers) is actually significant.

Measuring the efficacy of league formats in ranking football teams

Jan Lasek []
Choosing between different tournament designs based on their accuracy in ranking teams is an important topic in football since many domestic championships underwent changes in the recent years. In particular, the transformations of Ekstraklasa -- the top-tier football competition in Poland -- is a topic receiving much attention from the organizing body of the competition, participating football clubs as well as supporters. In this presentation we will discuss the problem of measuring the accuracy of different league formats in ranking teams. We will present various models for rating teams that will be next used to simulate a number of tournaments to evaluate their efficacy, for example, by measuring the probability of the best team win. Finally, we will discuss several other aspects of league formats including the influence of the number of points allocated for a win on the final league standings.

How accidental scientific success is?

Grzegorz Siudem [ WUT ]
Since the classic work of de Sola Price the rich get richer rule is well known as a most important mechanism governing the citation network dynamic. (Un-)Fortunatelly it is not sufficient to explain every aspect of the bibliometric data. Using the proposed agent-based model for the bibliometric networks we will shed some light on the problem and try to answer the important question from the title. Joint work with A. Cena, M. Gagolewski and B. Żogała-Siudem.