- data aggregation and fusion
- complex networks and agent-based models
- machine learning algorithms
- interdisciplinary modeling (economics, social sciences, sports, etc.)
|08.10.2019||Grzegorz Siudem||How to measure Science?|
|18.10.2019||–||Recent literature review|
|26.11.2019||Anna Cena||The true dimension of scientific impact|
|24.01.2020||Grzegorz Siudem||Scientific problems in the universities’ rankings|
Scientific problems in the universities’ rankings
Grzegorz Siudem [ WUT ]
The true dimension of scientific impact
The growing popularity of bibliometric indexes (whose most famous example is the h-index by J.E. Hirsch) is opposed by those claiming that one's scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarizes citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.
How to measure Science?
Grzegorz Siudem [ WUT ]
Science of Science describes the new interdisciplinary field which combines traditional Bibliometry, Philosophy and Sociology of Science with Complexity Science and Data Science. In this talk, I will present a summary of the recent achievements in the field as well as the list of (still) open questions. In addition to those purely academic considerations, we also consider the practical consequences of such research (e.g. in the context of the planning of scientific career or during ranking universities).
Schedule - 2018/2019
|20.09.2018||Anna Cena||Adaptive hierarchical clustering algorithms based on data aggregation methods|
|04.10.2018||Agnieszka I. Geras||Power law and its application|
|29.11.2018||Barbara Żogała-Siudem||Variable selection algorithms for linear models based on multidimensional index|
|28.02.2019||Grzegorz Siudem||Interpolation with an arbitrary precision|
|23.05.2019||Grzegorz Siudem||Methodology of University Rankings S01E01 - QS|
|30.05.2019||Anna Cena||Methodology of University Rankings S01E02 - Times Higher Education|
|30.05.2019||Barbara Żogała-Siudem||Methodology of University Rankings S01E03 - ShanghaiRanking|
|06.06.2019||Jan Lasek||New data-driven rating systems for association football|
|13.06.2019||Jarosław Klamut||Activity clustering in continuous-time random walk formalism|
|27.06.2019||Mikołaj Biesaga & Szymon Talaga||Discovery of socio-semantic networks based on discourse analysis on large corpora of documents|
Discovery of socio-semantic networks based on discourse analysis on large corpora of documents
Mikołaj Biesaga & Szymon Talaga [UW]
While reading newspapers of different publishers and watching different news channels humans intuitively perceive how different actors are described by different discourse sources. Can we do it in an automated and systematic fashion? Recent advances in natural language processing (NLP) techniques together with the increase of easily accessible computational power make it possible to create new analytic methods for studying socio-semantic systems. Especially, entity recognition methods and advanced part of speech tagging turned out to be crucial for automatic text processing. They allow to not only discover and classify main actors but also to understand semantics which are ascribed to them by content producers.
We propose a novel approach to automated discourse analysis on large corpora of documents which is a combination of four methods: entity recognition methods, topic modelling, sentiment analysis and analysis of syntactic dependencies. This approach allows to identify main actors (entity recognition method) and analyze differences in regard to how they are described (sentiment analysis and syntactic analysis) by different content sources. The differences are described along two main dimensions: semantics and sentiment (emotional valence). This technique makes it possible to discover complex networks of relations between actors and their discursive representations as well as content generating sources.
We present results of an application of the proposed approach to a corpus of texts published in 2017 and 2018 by leading European online magazines such as POLITICO Europe and Euronews English. The scope of the articles was narrowed to issues related to European Union (EU), Europe itself and world affairs as viewed from the European perspective. The main aim herein was to determine crucial actors (i.e. politicians, institutions) present in the discourse and discover how discourse generating actors (i.e. POLITICO Europe, Euronews English) perceive/depict important public figures and institutions over the time period. The analysis is based on English content exclusively, but we will also discuss how this approach might be extended to other languages.
Thursday, 15:15, room 102, Math building, WUT.
Activity clustering in continuous-time random walk formalism
Jarosław Klamut [UW]
Over 50 years ago, two physicists Montroll and Weiss in the physical context of dispersive transport and diffusion introduced stochastic process, named Continuous-Time Random Walk (CTRW). The trajectory of such a process is created by elementary events ‘spatial’ jumps preceded by waiting time. Since introduction, CTRW found innumerable application in different fields  including high-frequency finance , where jumps are considered as price increments and waiting times represent inter-trade times. Our latest results  suggest that dependencies between inter-trade times are the key element to explain activity clustering in financial time-series. We introduce the new CTRW model with long-term memory in waiting times, able to successfully describe power-law decaying time autocorrelation of the absolute values of price changes. We test our model on the empirical data from Polish stock market.
 Kutner, R., Masoliver, J. (2017), The continuous time random walk, still trendy: fifty-year history, state of art and outlook, Eur. Phys. J. B, 90(3), 50.
 Scalas, E. (2006), Five years of continuous-time random walks in econophysics, In The complex networks of economic interactions (pp. 3-16), Springer, Berlin, Heidelberg.
 Klamut, J. & Gubiec, T. (2019), Directed continuous-time random walk with memory, Eur. Phys. J. B 92:69.
Thursday, 18:10, room 103, Math building, WUT.
New data-driven rating systems for association football
Rating systems in sports have a number of important applications. These include constructing prediction models, providing team seedings for tournaments and qualifying rounds, or creating interesting match-ups. In this presentation, we discuss the methods for building accurate team ratings with a particular focus on association football. We present several well-founded baseline approaches and how they can be optimised to yield even better results in terms of match outcome prediction accuracy. Further, we also present a bottom-up approach which is based on deriving team ratings via individual player ratings from EA Sports FIFA video game. Next, we highlight the theory underlying the prominent Elo model. This serves as an inspiration for developing new, accurate as well as interpretable rating systems. We propose several such schemes in which the ratings are updated after consecutive matches using transparent update rules. As a further development of the bottom-up approach toward accurately measuring player skills, we propose a new model for player movements. The model is estimated using positional data that describe exact player positions during a match at a high frequency. In turn, it can be used to devise player and, in the next step, team ratings. Finally, we discuss how team rating models can be used to evaluate different league formats in a simulation study. This is an important issue in tournament design as domestic league formats vary significantly from country to country and can change from year to year. This study may help decision makers in sports to choose the optimal design that produces the most accurate team rankings.
Thursday 15:00, room 103, Math building, WUT.
Methodology of University Rankings S01E03 - ShanghaiRanking
Barbara Żogała-Siudem [SRI PAS]
Since 2003 (when the ShanghaiRanking was first published) academic rankings have become more and more popular among the policy makers, management of the universities and the general publicity. They provide a simple tool for comparison between different universities or allow to track the progress of the chosen institution over the years. In this series of seminars, we focus on the methodology of the most popular rankings, the so-called Big Four - QS, Times Higher Education, ShanghaiRanking and US News. Joint work with A. Cena, M.J. Mrowiński and G. Siudem. under the InternationalVisibility Project.
Thursday 14:00, room 103, Math building, WUT.
Methodology of University Rankings S01E02 - Times Higher Education
Since 2003 (when the ShanghaiRanking was first published) academic rankings have become more and more popular among the policy makers, management of the universities and the general publicity. They provide a simple tool for comparison between different universities or allow to track the progress of the chosen institution over the years. In this series of seminars, we focus on the methodology of the most popular rankings, the so-called Big Four - QS, Times Higher Education, ShanghaiRanking and US News. Joint work with M.J. Mrowiński, G. Siudem and B. Żogała-Siudem under the InternationalVisibility Project.
Thursday 14:00, room 101, Math building, WUT.
Methodology of University Rankings S01E01 - QS
Since 2003 (when the ShanghaiRanking was first published) academic rankings have become more and more popular among the policy makers, management of the universities and the general publicity. They provide a simple tool for comparison between different universities or allow to track the progress of the chosen institution over the years. In this series of seminars, we focus on the methodology of the most popular rankings, the so-called Big Four - QS, Times Higher Education, ShanghaiRanking and US News. Joint work with A. Cena, M.J. Mrowinski and B. Żogała-Siudem under the InternationalVisibility Project.
Interpolation with an arbitrary precision
The problem of the interpolation (e.g. fitting to some experimental data) is a well-known case in the numerical analysis. There are plenty of the typical solutions, however, most of them fail when one expects the results with arbitrarily high precision. In this talk, we present the long story of such failures which happily ends with a surprisingly elegant solution. Joint work with G. Świątek.
Variable selection algorithms for linear models based on multidimensional index
Barbara Żogała-Siudem [SRI PAS]
Power law and its application
Agnieszka I. Geras [WUT]
The normal distribution is commonly identified in natural sciences. However, it may not be the "dominant" distribution. We observe many phenomena which are extreme valued, heavy-tailed or described by power-law type functions. In this talk we will present statistical framework for discerning and quantifying power-law behaviour in empirical data, shed some light on the origins of power-law (the famous "the rich get richer rule") and take a look at interesting applications in the latest research concerning human online behaviour.
Thursday 16:15, room 431, Math building, WUT.
Adaptive hierarchical clustering algorithms based on data aggregation methods
Cluster analysis aims at determining an input data set's partition in such a way that the observations within each group are as similar (with respect to a given criterion) as possible to each other, while diversifying those from different groups. In this presentation, we introduce a new hierarchical agglomerative method which acts on a minimum spanning tree and is based on the partial information on the data structure obtained by applying the Genie algorithm. The initial, partial grouping is determined in an adaptive manner by computing the intersection of the partitions generated by the Genie method with a wide range of the inequity measure's threshold. Moreover, each element of the nested family of partitions is generated according to the information criterion or by minimizing a certain nearest neighbor-based dissimilarity with specific constraints.
Also, results of the investigation concerning the linkage schemes proposed by Yager in 2000 and then re-introduced by Nasibov and Kandemir-Cavas in 2011, which are generated by the weighted ordered averages -- the OWA operators, generalizing the single, complete, and average linkages will be discussed.
Schedule - 2017/2018
|24.11.2017||Grzegorz Siudem||How accidental scientific success is?|
|05.01.2018||Jan Lasek||Measuring the efficacy of league formats|
|26.01.2018||Marek Gagolewski||Stochastic properties of discrete Sugeno integrals|
|14.03.2018||Antoni Ruciński||Analysis of the Warsaw rail transport network|
|21.03.2018||Agnieszka I. Geras||Should we introduce a ‘dislike’ button for papers?|
|28.03.2018||Raúl Pérez-Fernández||Aggregation through the poset glass|
|08.06.2018||Hossein Yazdani||Bounded Fuzzy Possibilistic Method of Critical Objects Processing|
|21.06.2018||Maciej J. Mrowiński||Cartesian Genetic Programming with memory|
|20.06.2018||Maciej Bartoszuk||A source code similarity assessment system|
|27.06.2018||Mateusz Wiliński||Detectability of Macroscopic Structures in Directed Network|
Abstracts - 2017/2018
Detectability of Macroscopic Structures in Directed Networks: a Stochastic Block Model Approach
Disentangling network macroscopic structures is one of the funding problems in complexity science and also an important subjects in other fields like mathematics, physics or computer science. It is also interesting from the point of view of machine learning and data mining. One of the most basic models of communities in networks are the stochastic block models. It was recently shown in that in this case the detectability of real communities only from the network topology is limited. Even though the results were shown only for planted partition, where there are only two parameters, the conclusions are universal.
We examined a more general case of directed stochastic block model. More interestingly, we shown that by introducing a dissymmetry of direction, we are able to increase the range of the detectable phase. Importantly, this qualitative change holds for an entire class of hardly detectable models, where both the average in- and out-degree are the same across all groups. During my presentation I will show this unintuitive result both by means of numerical simulations and with an elegant analytic approximation.
Cartesian Genetic Programming with memory
Maciej J. Mrowiński [WUT]
Cartesian Genetic Programming (CGP) is an evolutionary programming algorithm whose purpose is to evolve computer programs using concepts inspired by natural selection. It's range of applications is wide and includes problems like optimisation or image processing. Programs in CGP are encoded as graphs. The structure of an encoded CGP program is very similar to a multilayer perceptron network with different activation functions assigned to nodes and with equal weights of all connections. A CGP program is evolved using the so called 4+1 algorithm, which tries to maximise a user-provided fitness function by creating, via mutation, new generations of programs.
Recurrent Cartesian Genetic Programming (RCGP) is a variant of CGP which allows cycles in the graphs representing programs, which implicitly introduces memory into CGP. The addition of memory broadens the range of applications of CGP and makes it a more viable tool for problems like time series forecasting. In our work, we propose a modification of CGP which results in an explicit inclusion of memory. We achieve this by directly including a shift register in each node of the CGP graph and constantly providing these registers with values processed by nodes. Thanks to this approach (which we call SRMCGP Shift Register Memory CGP), users gain fine-grained control over the memory in the program, which is not possible in RCGP, and avoid forward, recurrent connections. In order to study the memory capabilities of RCGP and SRMCGP, we performed numerical simulations of programs whose purpose was to memorise and repeat the input signal after a given number of time steps. Our results suggest that SRMCGP is much more efficient than RCGP [UTF-8?]â usable solutions/programs can be acquired faster through SRMCGP. Additionally, SRMCGP results in a smaller number of active nodes which makes SRMCGP programs less costly (it terms of computation time) to decode. SRMCGP is also more likely to actually create usable solutions/programs.
A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods
This presentation deals with the problem of detecting similar source codes (also known as clones) in functional languages, especially in R. Code clones' detection is useful in e.g. programming tutoring. This work introduces a new approach to evaluate performance of proposed methods. What is more, the current state-of-the-art code clone detection approaches based on Program Dependence Graphs (PDG) rely on some exponential-time subroutines. In this work, the performance of a polynomial-time algorithm based on the Weisfeier--Lehman method for finding similar graphs is examined. Since all the algorithms for comparing source codes focus on different code similarity aspects, the current work also proposes new similarity aggregation method (based on B-splines curves and surfaces) of the results generated by different approaches. The new models not only posses an intuitive interpretation, but also were experimentally shown to yield better results. In this work for the first time non symmetric measures were proposed, quantifying the degree to which one function is contained within a second one and vice versa were proposed.
Friday 12:15, SRI PAS room 200
Bounded Fuzzy Possibilistic Method of Critical Objects Processing in Machine Learning
Unsatisfactory accuracy of conventional methods used in machine learning is mostly caused by omitting the influence of important factors in learning procedures. Type of data objects, membership assignments, and distance or similarity functions are among others such parameters. Improper accuracy of prototype-based (centroid-based) methods and causes of misclassifications have been fully studied. Data objects are considered as fundamental keys in learning methods and knowing the exact type of objects leads to provide a suitable environment for learning algorithms. A new type of object called critical object has been introduced that plays the important role in each dataset. These objects are considered as the causes of misclassification in learning methods. Objects' movements have been also analyzed and then a new method to handle and track the objects' movements has been proposed. In other words, the main goal was to introduce a new method with special processing of critical objects. The new proposed method called Bounded Fuzzy Possibilistic Method (BFPM) addresses several issues that previous clustering/classification methods have not considered. In fuzzy clustering, the object's membership values should sum to 1. Hence, any data object may obtain full membership in at most one cluster. Possibilistic clustering methods remove this restriction. However, the BFPM method differs from previous fuzzy and possibilistic clustering approaches by allowing the membership function to take larger values with no restriction. Furthermore, in the BFPM method, a data object can obtain full membership in multiple clusters or even in all clusters. A new type of feature called as dominant has been introduced that is considered as another cause of misclassifications. And then new similarity functions called Weighted Feature Distance (WFD) and Prioritized Weighted Feature Distance (PWFD) have been proposed to cover diversity in vector and feature spaces, as well as handling the impact of dominant features.
In experimental verifications, the most well-known benchmark datasets available in the Internet that have been also used in tests described in many scientific publications are chosen. Fuzzy C-Means function and algorithms as well as advanced modified prototype-based methods have been compared with the proposed method. The new method was compared with conventional supervised and unsupervised learning methods in terms of accuracy. Promising results achieved by the experiments prove that the BFPM method ensures better accuracy than conventional learning methods due to taking into account the critical objects and dominant features.
Aggregation through the poset glass
The aggregation of several objects into a single one is a common study subject in mathematics. Unfortunately, whereas practitioners often need to deal with the aggregation of many different types of objects (rankings, graphs, strings, etc.), the current theory of aggregation is mostly developed for dealing with the aggregation of values in a poset. In this presentation, we will reflect on the limitations of this poset-based theory of aggregation and "jump through the poset glass". On the other side, we will not find Wonderland, but, instead, we will find more questions than answers. Indeed, a new theory of aggregation is being born, and we will need to work together on this reboot for years to come.
Should we introduce a ‘dislike’ button for academic papers?
Agnieszka I. Geras [WUT]
Citations scores and the h-index are basic tools used for measuring the quality of scientific work. Nonetheless, while evaluating academic achievements one rarely takes into consideration for what reason the paper was mentioned by another author - whether in order to highlight the connection between their work or to bring to the reader’s attention any mistakes or flaws. In my talk I will shed some insight into the problem of "negative" citations analyzing data from the Stack Exchange and using the proposed agent-based model. Joint work with M. Gągolewski and G. Siudem.
Analysis of the Warsaw rail transport network
Have you ever wondered how the transport system really works and what are the trams traffic rules? What is the impact of a single tram stop to the entire network? What would happen if the most significant stop is cancelled? No? Don't worry! I guess that thousand of people travelling by public transport everyday have not wondered neither.
Online and timetable data has made it possible to answer to the questions and indicate the most meaningful features of the network. On one hand various centrality measures have been calculated to locate the most influential nodes. On the other, percentage share of individual tram lines or low-floor trams has given information about the traffic organisation.
What is the trams' ability to adhere to a timetable? Which lines are the most delayed and why? Don't know? Come and listen to my speech!
Stochastic properties of the Hirsch index and other discrete Sugeno integrals
Hirsch's h-index is perhaps the most popular citation-based measure of scientific excellence. Many of its natural generalizations can be expressed as simple functions of some discrete Sugeno integrals. In this talk we shall review some less-known results concerning various stochastic properties of the discrete Sugeno integral with respect to a symmetric normalized capacity, i.e., weighted lattice polynomial functions of real-valued random variables - both in i.i.d. (independent and identically distributed) and non-i.i.d. (with some dependence structure) cases. For instance, we will be interested in investigating their exact and asymptotic distributions. Based on these, we can, among others, show that the h-index is a consistent estimator of some natural probability distribution's location characteristic. Moreover, we can derive a statistical test to verify whether the difference between two h-indices (say, h'=7 vs. h''=10 in cases where both authors published 40 papers) is actually significant.
Measuring the efficacy of league formats in ranking football teams
Choosing between different tournament designs based on their accuracy in ranking teams is an important topic in football since many domestic championships underwent changes in the recent years. In particular, the transformations of Ekstraklasa -- the top-tier football competition in Poland -- is a topic receiving much attention from the organizing body of the competition, participating football clubs as well as supporters. In this presentation we will discuss the problem of measuring the accuracy of different league formats in ranking teams. We will present various models for rating teams that will be next used to simulate a number of tournaments to evaluate their efficacy, for example, by measuring the probability of the best team win. Finally, we will discuss several other aspects of league formats including the influence of the number of points allocated for a win on the final league standings.
How accidental scientific success is?
Grzegorz Siudem [ WUT ]
Since the classic work of de Sola Price the rich get richer rule is well known as a most important mechanism governing the citation network dynamic. (Un-)Fortunatelly it is not sufficient to explain every aspect of the bibliometric data. Using the proposed agent-based model for the bibliometric networks we will shed some light on the problem and try to answer the important question from the title. Joint work with A. Cena, M. Gagolewski and B. Żogała-Siudem.