## Seminar Info

#### Hosts

- Prof. Marek Gagolewski (Warsaw University of Technology, Poland)
- Dr Grzegorz Siudem (Warsaw University of Technology, Poland)

#### Time

Tuesdays, 10:15 am – 12:00 am CET/CEST (Warsaw time)

#### Location

Faculty of Mathematics and Information Science, Warsaw University of Technology, room 318 or online on M$ Teams.

#### Topics

- data fusion, aggregation, and clustering
- complex networks and agent-based models
- machine learning algorithms
- computational and applied statistics
- statistical software
- interdisciplinary modelling (economics, informetrics, bibliometrics, science of science, social sciences, sports, etc.)

### Schedule

### Abstracts

###### 28.05.2024

*TBA*

*TBA*

##### Łukasz Brzozowski ^{[Warsaw University of Technology] }

TBA

###### 14.05.2024

*Information-based clustering: introduction*

*Information-based clustering: introduction*

##### Anna Cena ^{[Warsaw University of Technology] }

Information-based clustering offers interesting approach to data grouping. However, there are a lot of questions regarding the specifics: form theoretical foundations and unclear notation to usefulness of such techniques. In this talk we will adress some issues, raise many more and hopefully understad better what it is we are dealing with.

###### 07.05.2024

*Biomass Flows in Food Webs: Rank-Size Analysis and DGBD Modeling*

*Biomass Flows in Food Webs: Rank-Size Analysis and DGBD Modeling*

##### Przemysław Nowak ^{[Warsaw University of Technology] }

This study analyzes food webs, focusing on the dependencies of biomass quantities and their flows in the context of rank-size distributions. The research is aimed to examine how these dependencies occur in various food webs and to develop a simulation model that accurately reflects these relationships.

The study employed the Discrete Generalized Beta Distribution (DGBD), which showed good agreement with all analyzed food webs. The results indicate that sorting the data, rather than averaging it, is a crucial element in the analysis of rank-size distributions. This is particularly important in the context of real-world data, where we often have only a single series available, which can only be sorted and not averaged.

The proposed simulation model of food webs replicates the properties of the DGBD distribution, thus offering a tool that bridges the theoretical approach with the practical aspects of real data analysis. This work serves as a bridge between theoretical mathematical modeling and practical empirical data analysis, highlighting the importance of sorting in the context of rank-size distributions in food webs.

###### 23.04.2024

*Objective functions in community detection*

*Objective functions in community detection*

##### Łukasz Brzozowski ^{[Warsaw University of Technology] }

In this seminar we will look at various objective functions whose optimization is used in community detection in graphs. In particular, we will talk about the prospects of generalizations of those objective functions, as well as their behaviour in some border cases.

###### 16.04.2024

*Normalised Clustering Accuracy: An Asymmetric External Cluster Validity Measure*

*Normalised Clustering Accuracy: An Asymmetric External Cluster Validity Measure*

##### Marek Gagolewski ^{[WUT & SRI PAS] }

There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to distinguish between methods which work well on certain task types and those that systematically under-perform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable, because the clusterings they promote can sometimes be meaningless.

External measures, on the other hand, compare the algorithms’ outputs to the fixed ground truth groupings that are provided by experts. In this talk, we will argue that the commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes–Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly nor are they easily interpretable. As a consequence, it can be difficult to evaluate clustering algorithms on diverse benchmark datasets. To remedy these issues, we'll propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).

###### 20.12.2023

Wednesday 10:15, room 211, Math building, WUT.

#### Analyzing Graphs with Edge Attributes from Continuous Distributions

##### Łukasz Brzozowski ^{[Warsaw University of Technology] }

In real-world graph datasets, edge attributes are frequently represented with non-integer values, such as in measurements of node similarity, distance, or energy. However, research on the statistical properties of random graphs, where edge attributes are modelled as variables from continuous distributions, remains limited. Our study poses quite a general question: given two continuous distributions, $F_D$ for degree distribution (sum of incident edges) and $F_W$ for edge weight, can a graph exhibiting these distributions be obtained? We will show that given the independence assumptions, some analytic results are readily available, while other cases can be tackled algorithmically.

###### 30.01.2023

#### Modelling ecosystems - food webs, multilayer networks, hypergraphs.

##### Mateusz Iskrzyński ^{[Polish Academy of Sciences] }

Species extinctions are compromising ecosystem functioning and services around the globe. The effects of species loss propagate over food webs (trophic networks). Food webs are directed graphs that represent feeding relationships, encoding matter flows as links between groups of species in an ecosystem. Previous food web research has been limited by reliance on purely theoretical models, few food webs, and only unweighted networks.

In this seminar I will present our research on food web structure and vulnerability based on world's largest database of 245 weighted empirical food webs. I will show our various visualisation methods [1], show the importance of weights using mass cycling as an example [2].

[1] Pawluczuk, Ł. & Iskrzyński, M. (2022) Food web visualisation: Heat map, interactive graph and animated flow network. Methods in Ecology and Evolution, https://doi.org/10.1111/2041-210X.13839, https://github.com/ibs-pan/foodwebviz [2] Iskrzyński, M., Janssen, F., Picciolo, F., Fath, B., Ruzzenenti, F. (2021) Cycling and reciprocity in weighted food webs and economic networks. J Ind Ecol. https://doi.org/10.1111/jiec.13217

###### 05.12.2022

#### Conditional aggregation operators and its applications

##### Michał Boczek ^{[Lodz University of Technology] }

In this talk, we will introduce the concept of conditional aggregation operator and discuss its basic properties. We will use it to generalize survival functions and list some applications, including scientometrics. We will then present some Choquet-like operators based on (i) generalizations of the survival function and (ii) conditional aggregation operators.

###### 07.11.2022

#### Data collection methods for blockchain wallets and transaction data. Wallet classification, detection of mixers and blockchain fraud.

##### Kinga Pilch and Dominik Kolasa ^{[ Warsaw University of Technology ] }

Collecting data using scrapers from blockchain related websites. Use of additional data sources - APIs. Labeling data using sentiment analysis. Blockchain wallet classification using neural networks. Analysis of wallet data, detection of mixers and crime suspected addresses.

###### 26.05.2022

*q-voter models with quenched and annealed disorders on networks*

*q-voter models with quenched and annealed disorders on networks*

##### Arkadiusz Jędrzejewski ^{[ Wrocław University of Science and Technology ] }

Using two models of opinion dynamics, the q-voter model with independence and the q-voter model with anticonformity, I will discuss how the change of disorder from annealed to quenched affects phase transitions on different networks. The results indicate that such a change of disorder eliminates all discontinuous phase transitions and broadens ordered phases. The comparative analysis on networks reveals differences between the models that are not displayed in well-mixed populations, at the mean-field level. This demonstrates how important is the network structure in answering the question about the differences between dynamics. Numerical and analytical methods based on the mean-field and pair approximations are used to analyze the problem.

###### 28.04.2022

*Technology of co-author network scientometric analysis using the terms extracted from abstract databases*

*Technology of co-author network scientometric analysis using the terms extracted from abstract databases*

##### Iryna Balagura ^{[ National Academy of Sciences of Ukraine ] }

Research deals with the investigation of co-authors and co-words networks appliance for scientometric analysis of scientific databases. Using of computer linguistics methods in order to fulfill scientometric investigation of the abstract database was proposed. Innovative methods for estimating the level of inter-scientists cooperation were developed. Methods based on the improved centrality index and Borda count method were used for science group detecting and co-word network building. Proposed methods allow estimating of “weight” coefficient of authors with equal co-author ties. It is shown that the construction of networks of terms based on scientific papers in the area of the most active and cooperative scientists allows determining associative relations as well as entry relations for developing ontology research directions. It is shown that clusters in networks of a co-authorship can be considered as a basis for identification of schools of sciences.

The algorithm for expert groups’ determination by formal characteristics is proposed, it solves the problem of expert groups’ determination in accordance with their qualifications, possession of special knowledge for scientific and technical expertise. The expert groups’ identification could be accomplished by defining the clusters in co-authorship networks and building appropriate terms network.

###### 07.04.2022

*Keywords: What Do They Know? Do They Know Things? Let’s Find Out!*

*Keywords: What Do They Know? Do They Know Things? Let’s Find Out!*

##### Anna Cena ^{[ WUT ] }

The talk will address the issue of keywords in scientific literature. We will discuss the most interesting methods of analyzing them (e.g. co-occurence methods, clustering and / or social network analysis) and present the preliminary results obtained for subset of Elsevier database.

###### 31.01.2022

*Scientific success from the perspective of the strength of weak ties*

*Scientific success from the perspective of the strength of weak ties*

##### Maciej J. Mrowiński ^{[ WUT ] }

Granovetter’s theory of weak ties in social networks can be expressed as as two separate hypotheses. The first hypothesis states that strong ties, which are responsible for the majority of interactions in a network, are located in highly connected clusters of nodes. According to the second hypothesis, these clusters are connected by weak ties and nodes with access to such ties have an advantage over those with only strong connections. In our work, we test both parts of Granovetter’s theory using the co-authorship network of articles recreated from the DBLP dataset. We show that the very definition of connection strength and the asymmetry of social ties play a crucial role in the theory.

###### 24.01.2022

*The Formula for (bibliometric) success*

*The Formula for (bibliometric) success*

##### Grzegorz Siudem ^{[ WUT ] }

During the talk, we will be looking for compact, analytical formulas for calculating canonical bibliometric indexes. We will start with the well-known Lotka informetrics and generalise our considerations to more complex bibliometric models.

###### 17.01.2022

*An overview of honeycomb-based graph aggregation functions*

*An overview of honeycomb-based graph aggregation functions*

##### Grzegorz Moś ^{[ University of Silesia in Katowice ] }

Structures that are based on honeycomb graphs appear in many scientific fields. Many chemical molecules contain such a structure. Some polymers consist of a hexagonal part in their repeating subunits or are made with such hexagons [1], [2]. Some regular benzenoid strips and their polynomial representation were presented [3], [4], [5]. There are many appearances of honeycombs in nature. They are widely used in material science and physics because of their properties [6]. The prevalence leads to the necessity of handling a vast amount of data. That indicates that the databases have to be greater and greater. Thus, optimization and aggregation methods are essential to possess and process such data.

An aggregation is a mapping that returns exactly one object from a given sequence of objects [7]. The important part of data aggregation is comparing the objects. It is well studied for numbers where there are many aggregation functions, for example, averages, minimum, maximum. Strings can be aggregated with a merge function or with a length function [8]. Hence, it is possible to aggregate any object. The main problem is to define the axioms with meaningful grounds. The idea has to have logical justification under-considered field, for example, average in statistics or minimum and maximum in fuzzy logic. Some axioms are expected to be satisfied by any aggregation function. The returned element should not be simpler than the simplest considered object. On the other hand, it should not be more complex than the most complex considered object. Moreover, associative, commutative, and monotone properties should be satisfied.

The main point of this presentation is to specify the grounds of the honeycomb-based graphs aggregation and explain such an approach. It is justified by given the most uncomplicated cases. The graphs based on a hexagon grid where every vertex belongs to at most two edges and exactly two vertices that belong to exactly one edge each are the most straightforward considered structures. Moreover, it is assumed that the graph is connected. Otherwise, it would be enough to treat a disconnected graph as a set of connected graphs. The next key step is to introduce an approach to deal with vertices of degree 3. Furthermore, there are presented first examples of aggregating functions. The greatest common subset and the least set containing sets are adapted to graphs as a considered purpose. This is preceded by exploring current results from the bibliography, which has something in common with aggregate theory. The conclusion and future work are formulated at the end.

[1] Coleman, M. M. (2019). Fundamentals of Polymer Science: An introductory text. Routledge.

[2] McCrum, N. G., Buckley, C. P., Bucknall, C. B., Bucknall, C. B., & Bucknall, C. B. (1997). Principles of polymer engineering. Oxford University Press, USA.

[3] Langner, J., Witek, H. A., & Mos, G. (2018). Zhang-zhang polynomials of multiple zigzag chains. MATCH Commun. Math. Comput. Chem, 80(1), 245-265.

[4] Witek, H. A., Langner, J., Moś, G., & Chou, C. P. (2017). Zhang–Zhang polynomials of regular 5–tier benzenoid strips. MATCH Commun. Math. Comput. Chem, 78(2), 487-504.

[5] Witek, H. A., Moś, G., & Chou, C. P. (2015). Zhang-Zhang polynomials of regular 3-and 4-tier benzenoid strips. MATCH Commun. Math. Comput. Chem, 73(2), 427-442.

[6] Wang, Z. (2019). Recent advances in novel metallic honeycomb structure. Composites Part B: Engineering, 166, 731-741.

[7] Grabisch, M., Marichal, J. L., Mesiar, R., & Pap, E. (2009). Aggregation functions (No. 127). Cambridge University Press.

[8] Gągolewski, M. (2015). Data fusion: theory, methods, and applications. Institute of Computer Science, Polish Academy of Scienceso.

###### 10.01.2022

*Citation vectors and impact measures - a clustering aporoach*

*Citation vectors and impact measures - a clustering aporoach*

##### Anna Cena ^{[ WUT ] }

Technical report of the research

The seminar addresses potential research in the area of scientometrics based on clustering methods and algorithms. The results obtained in terms of citation vector clustering as well as parameter analysis of selected citation models will be discussed.

###### 22.11.2021

*Modelling of citation vectors and Hirsch index*

*Modelling of citation vectors and Hirsch index*

##### Aleksandra Buczek ^{[ WUT ] }

Citation network is one of numerous examples of complex networks. Recent years have witnessed a growing interest of scientific community in various methods of estimating and measuring of the performance of researchers’ outputs. Scientific impact of a researcher can be now determined by multiple bibliometric indicators. Moreover, many models for the citation network have been developed over the years.

One of these models is the IC model, proposed in 2013 by Georgia Ionescu and Bastien Chopard. It reconstructs a citation vector based only on the numbers of publications and citations of a given author. In every time step constant number of citations is distributed among papers. There are two different groups of citations – self and external, both distributed according to the preferential attachment rule.

In my Engineer’s and Master’s Theses I worked on some modifications of the IC model, which were meant to improve its efficiency. I tested new ideas for numbers of citations distributed in every time step and new methods of citation distribution. In both cases modifications were based on the analysis of real bibliometric data. In my presentation I would like to talk about the IC model, proposed modifications and obtained results.

###### 18.10.2021

*Luck, Reason, and the Price–Pareto type-2 Distributions*

*Luck, Reason, and the Price–Pareto type-2 Distributions*

##### Przemysław Nowak ^{[ WUT ] }

Price model for the growth of a bibliographic network is a known approach since 1965. In our case we distribute citations using both preferential attachment (also known as rich get richer) and random attachment rule. Applying the new method based on rank-size distribution we show that Price model can be described with Pareto-type 2 distribution. Moreover, knowing of this Pareto-Price relation we perform different estimators of the underlying model parameters, based on DBLP database.

###### 12.07.2021

*DGBD*

*DGBD*

##### Grzegorz Siudem ^{[ WUT ] }

Discrete Generalised Beta Distribution (DGBD) finds application in the description of data coming from an impressively broad spectrum of fields and disciplines. From scientometrics, to income and complex network modelling, to art and ecology. What is this distribution? What makes it universal? And whether has anything to do with the Beta Distribution? I will try to answer these questions during my talk.

###### 26.10.2020

*Ockham’s index*

*Ockham’s index*

##### Marek Gagolewski ^{[ Deakin University ]}

We demonstrate that by using a triple of simple numerical summaries: an author's productivity, their overall impact, and a single other bibliometric index that aims to capture the inequality or skewness of the citation distribution, we can reconstruct other popular metrics of bibliometric impact with a sufficient degree of precision. We thus conclude that the use of many indices may be unnecessary – entities should not be multiplied beyond necessity. Such a study was possible thanks to our new agent-based model (Siudem, Żogała-Siudem, Cena, Gagolewski; PNAS 117; 2020), which not only assumes that citations are distributed according to a mixture of the rich-get-richer rule and sheer chance, but also fits real bibliometric data quite well. We investigate which bibliometric indices have good discriminative power, which measures can be easily predicted as functions of other ones, and what implications to the research evaluation practice our findings have.

###### 08.09.2020

*How to fit to bibliometric data?*

*How to fit to bibliometric data?*

##### Barbara Żogała-Siudem ^{[SRI PAS] }

We show why fitting to cumulative sums of the highest items instead of to the original rank-size distribution can be beneficial in the case of heavy-tailed data that are frequently observed in informetrics and similar disciplines. Based on this observation, we analyse reparameterised versions of the discrete generalised beta distribution (DGBD) and power law models that preserve the total sum of elements in a citation vector. They enjoy much better predictive power with respect to many bibliometric indices.

###### 17.07.2020

*Preferential attachment rules in modeling bibliometric data*

*Preferential attachment rules in modeling bibliometric data*

##### Przemysław Nowak ^{[ WUT ] }

The purpose of this talk was the analysis and modification of preferential attachment rule. It was done based on Barabási-Albert model for citation network and Ionescu-Chopard model for vectors of citations. It has been proposed to modify preferential attachment by adding random attachment. By applying those approaches to data from DBLP database, it was proved that both BA and IC models gave the same results.

###### 19.05.2020

*Pareto-Tsallis distribution for bibliometric data*

*Pareto-Tsallis distribution for bibliometric data*

##### Grzegorz Siudem ^{[ WUT ] }

During the talk we will follow the work: Néda Z, Varga L, Biró TS (2017) Science and Facebook: The same popularity law!. PLOS ONE 12(7): e0179656. https://doi.org/10.1371/journal.pone.0179656 and derive a Pareto-Tsallis distribution as an useful tool in the description of the citation network.

###### 26.11.2019

*The true dimension of scientific impact*

*The true dimension of scientific impact*

##### Anna Cena ^{[ WUT ] }

The growing popularity of bibliometric indexes (whose most famous example is the h-index by J.E. Hirsch) is opposed by those claiming that one's scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative description. We argue that neither of the two controversial extremes is true. By assuming that some citations are distributed according to the rich get richer rule (preferential attachment) while some others are assigned totally at random (all in all, a paper needs a bibliography), we have crafted a model that accurately summarises citation records with merely three easily interpretable parameters: productivity, total impact, and how lucky an author has been so far.

###### 08.10.2019

*How to measure Science?*

*How to measure Science?*

##### Grzegorz Siudem ^{[ WUT ] }

Science of Science describes the new interdisciplinary field which combines traditional Bibliometrics, Philosophy and Sociology of Science with Complexity Science and Data Science. In this talk, I will present a summary of the recent achievements in the field as well as the list of (still) open questions. In addition to those purely academic considerations, we also consider the practical consequences of such research (e.g. in the context of the planning of scientific career or during ranking universities).

###### 27.06.2019

*Discovery of socio-semantic networks based on discourse analysis on large corpora of documents*

*Discovery of socio-semantic networks based on discourse analysis on large corpora of documents*

##### Mikołaj Biesaga & Szymon Talaga ^{[UW] }

While reading newspapers of different publishers and watching different news channels humans intuitively perceive how different actors are described by different discourse sources. Can we do it in an automated and systematic fashion? Recent advances in natural language processing (NLP) techniques together with the increase of easily accessible computational power make it possible to create new analytic methods for studying socio-semantic systems. Especially, entity recognition methods and advanced part of speech tagging turned out to be crucial for automatic text processing. They allow to not only discover and classify main actors but also to understand semantics which are ascribed to them by content producers.

We propose a novel approach to automated discourse analysis on large corpora of documents which is a combination of four methods: entity recognition methods, topic modelling, sentiment analysis and analysis of syntactic dependencies. This approach allows to identify main actors (entity recognition method) and analyse differences in regard to how they are described (sentiment analysis and syntactic analysis) by different content sources. The differences are described along two main dimensions: semantics and sentiment (emotional valence). This technique makes it possible to discover complex networks of relations between actors and their discursive representations as well as content generating sources.

We present results of an application of the proposed approach to a corpus of texts published in 2017 and 2018 by leading European online magazines such as POLITICO Europe and Euronews English. The scope of the articles was narrowed to issues related to European Union (EU), Europe itself and world affairs as viewed from the European perspective. The main aim herein was to determine crucial actors (i.e. politicians, institutions) present in the discourse and discover how discourse generating actors (i.e. POLITICO Europe, Euronews English) perceive/depict important public figures and institutions over the time period. The analysis is based on English content exclusively, but we will also discuss how this approach might be extended to other languages.

###### 13.06.2019

*Activity clustering in continuous-time random walk formalism*

*Activity clustering in continuous-time random walk formalism*

##### Jarosław Klamut ^{[UW] }

Over 50 years ago, two physicists Montroll and Weiss in the physical context of dispersive transport and diffusion introduced stochastic process, named Continuous-Time Random Walk (CTRW). The trajectory of such a process is created by elementary events ‘spatial’ jumps preceded by waiting time. Since introduction, CTRW found innumerable application in different fields [1] including high-frequency finance [2], where jumps are considered as price increments and waiting times represent inter-trade times. Our latest results [3] suggest that dependencies between inter-trade times are the key element to explain activity clustering in financial time-series. We introduce the new CTRW model with long-term memory in waiting times, able to successfully describe power-law decaying time autocorrelation of the absolute values of price changes. We test our model on the empirical data from Polish stock market.

[1] Kutner, R., Masoliver, J. (2017), The continuous time random walk, still trendy: fifty-year history, state of art and outlook, Eur. Phys. J. B, 90(3), 50.

[2] Scalas, E. (2006), Five years of continuous-time random walks in econophysics, In The complex networks of economic interactions (pp. 3-16), Springer, Berlin, Heidelberg.

[3] Klamut, J. & Gubiec, T. (2019), Directed continuous-time random walk with memory, Eur. Phys. J. B 92:69.

###### 06.06.2019.2

*New data-driven rating systems for association football*

*New data-driven rating systems for association football*

##### Jan Lasek ^{[deepsense.ai]}

Rating systems in sports have a number of important applications. These include constructing prediction models, providing team seedings for tournaments and qualifying rounds, or creating interesting match-ups. In this presentation, we discuss the methods for building accurate team ratings with a particular focus on association football. We present several well-founded baseline approaches and how they can be optimised to yield even better results in terms of match outcome prediction accuracy. Further, we also present a bottom-up approach which is based on deriving team ratings via individual player ratings from EA Sports FIFA video game. Next, we highlight the theory underlying the prominent Elo model. This serves as an inspiration for developing new, accurate as well as interpretable rating systems. We propose several such schemes in which the ratings are updated after consecutive matches using transparent update rules. As a further development of the bottom-up approach toward accurately measuring player skills, we propose a new model for player movements. The model is estimated using positional data that describe exact player positions during a match at a high frequency. In turn, it can be used to devise player and, in the next step, team ratings. Finally, we discuss how team rating models can be used to evaluate different league formats in a simulation study. This is an important issue in tournament design as domestic league formats vary significantly from country to country and can change from year to year. This study may help decision makers in sports to choose the optimal design that produces the most accurate team rankings.

###### 30.05.2019.2

*Methodology of University Rankings S01E03 - Shanghai Ranking*

*Methodology of University Rankings S01E03 - Shanghai Ranking*

##### Barbara Żogała-Siudem ^{[SRI PAS] }

Since 2003 (when the Shanghai Ranking was first published) academic rankings have become more and more popular among the policy makers, management of the universities and the general publicity. They provide a simple tool for comparison between different universities or allow to track the progress of the chosen institution over the years. In this series of seminars, we focus on the methodology of the most popular rankings, the so-called Big Four - QS, Times Higher Education, Shanghai Ranking and US News. Joint work with A. Cena, M.J. Mrowiński and G. Siudem. under the International Visibility Project.

###### 30.05.2019

*Methodology of University Rankings S01E02 - Times Higher Education*

*Methodology of University Rankings S01E02 - Times Higher Education*

##### Anna Cena ^{[WUT] }

Since 2003 (when the Shanghai Ranking was first published) academic rankings have become more and more popular among the policy makers, management of the universities and the general publicity. They provide a simple tool for comparison between different universities or allow to track the progress of the chosen institution over the years. In this series of seminars, we focus on the methodology of the most popular rankings, the so-called Big Four - QS, Times Higher Education, Shanghai Ranking and US News. Joint work with M.J. Mrowiński, G. Siudem and B. Żogała-Siudem under the International Visibility Project.

###### 23.05.2019

*Methodology of University Rankings S01E01 - QS*

*Methodology of University Rankings S01E01 - QS*

##### Grzegorz Siudem ^{[WUT] }

Since 2003 (when the Shanghai Ranking was first published) academic rankings have become more and more popular among the policy makers, management of the universities and the general publicity. They provide a simple tool for comparison between different universities or allow to track the progress of the chosen institution over the years. In this series of seminars, we focus on the methodology of the most popular rankings, the so-called Big Four - QS, Times Higher Education, Shanghai Ranking and US News. Joint work with A. Cena, M.J. Mrowinski and B. Żogała-Siudem under the International Visibility Project.

###### 28.02.2019

*Interpolation with an arbitrary precision*

*Interpolation with an arbitrary precision*

##### Grzegorz Siudem ^{[WUT] }

The problem of the interpolation (e.g. fitting to some experimental data) is a well-known case in the numerical analysis. There are plenty of the typical solutions, however, most of them fail when one expects the results with arbitrarily high precision. In this talk, we present the long story of such failures which happily ends with a surprisingly elegant solution. Joint work with G. Świątek.

###### 29.11.2018

*Variable selection algorithms for linear models based on multidimensional index*

*Variable selection algorithms for linear models based on multidimensional index*

##### Barbara Żogała-Siudem ^{[SRI PAS] }

We will analyze methods for variable selection in linear models, built based on a very large number of predictors, which cannot be stored in RAM. We will explain what multidimensional indexing is and how can it be of use in modifications of stepwise regression and the Lasso method applied to large collectionc of variables.

###### 04.10.2018

*Power law and its application*

*Power law and its application*

##### Agnieszka I. Geras ^{[WUT] }

The normal distribution is commonly identified in natural sciences. However, it may not be the "dominant" distribution. We observe many phenomena which are extreme valued, heavy-tailed or described by power-law type functions. In this talk we will present statistical framework for discerning and quantifying power-law behaviour in empirical data, shed some light on the origins of power-law (the famous "the rich get richer rule") and take a look at interesting applications in the latest research concerning human online behaviour.

###### 20.09.2018

Thursday 16:15, room 431, Math building, WUT.

*Adaptive hierarchical clustering algorithms based on data aggregation methods*

*Adaptive hierarchical clustering algorithms based on data aggregation methods*

##### Anna Cena ^{[SRI PAS] }

Cluster analysis aims at determining an input data set's partition in such a way that the observations within each group are as similar (with respect to a given criterion) as possible to each other, while diversifying those from different groups. In this presentation, we introduce a new hierarchical agglomerative method which acts on a minimum spanning tree and is based on the partial information on the data structure obtained by applying the Genie algorithm. The initial, partial grouping is determined in an adaptive manner by computing the intersection of the partitions generated by the Genie method with a wide range of the inequity measure's threshold. Moreover, each element of the nested family of partitions is generated according to the information criterion or by minimizing a certain nearest neighbor-based dissimilarity with specific constraints.

Also, results of the investigation concerning the linkage schemes proposed by Yager in 2000 and then re-introduced by Nasibov and Kandemir-Cavas in 2011, which are generated by the weighted ordered averages -- the OWA operators, generalizing the single, complete, and average linkages will be discussed.

###### 27.06.2018

*Detectability of Macroscopic Structures in Directed Networks: a Stochastic Block Model Approach*

*Detectability of Macroscopic Structures in Directed Networks: a Stochastic Block Model Approach*

##### Mateusz Wiliński ^{[Scuola Normale Superiore di Pisa] }

Disentangling network macroscopic structures is one of the funding problems in complexity science and also an important subjects in other fields like mathematics, physics or computer science. It is also interesting from the point of view of machine learning and data mining. One of the most basic models of communities in networks are the stochastic block models. It was recently shown in that in this case the detectability of real communities only from the network topology is limited. Even though the results were shown only for planted partition, where there are only two parameters, the conclusions are universal.

We examined a more general case of directed stochastic block model. More interestingly, we shown that by introducing a dissymmetry of direction, we are able to increase the range of the detectable phase. Importantly, this qualitative change holds for an entire class of hardly detectable models, where both the average in- and out-degree are the same across all groups. During my presentation I will show this unintuitive result both by means of numerical simulations and with an elegant analytic approximation.

###### 21.06.2018

**Thursday 16:15**

*Cartesian Genetic Programming with memory*

*Cartesian Genetic Programming with memory*

##### Maciej J. Mrowiński ^{[WUT] }

Cartesian Genetic Programming (CGP) is an evolutionary programming algorithm whose purpose is to evolve computer programs using concepts inspired by natural selection. It's range of applications is wide and includes problems like optimisation or image processing. Programs in CGP are encoded as graphs. The structure of an encoded CGP program is very similar to a multilayer perceptron network with different activation functions assigned to nodes and with equal weights of all connections. A CGP program is evolved using the so called 4+1 algorithm, which tries to maximise a user-provided fitness function by creating, via mutation, new generations of programs.

Recurrent Cartesian Genetic Programming (RCGP) is a variant of CGP which allows cycles in the graphs representing programs, which implicitly introduces memory into CGP. The addition of memory broadens the range of applications of CGP and makes it a more viable tool for problems like time series forecasting. In our work, we propose a modification of CGP which results in an explicit inclusion of memory. We achieve this by directly including a shift register in each node of the CGP graph and constantly providing these registers with values processed by nodes. Thanks to this approach (which we call SRMCGP Shift Register Memory CGP), users gain fine-grained control over the memory in the program, which is not possible in RCGP, and avoid forward, recurrent connections. In order to study the memory capabilities of RCGP and SRMCGP, we performed numerical simulations of programs whose purpose was to memorise and repeat the input signal after a given number of time steps. Our results suggest that SRMCGP is much more efficient than RCG usable solutions/programs can be acquired faster through SRMCGP. Additionally, SRMCGP results in a smaller number of active nodes which makes SRMCGP programs less costly (it terms of computation time) to decode. SRMCGP is also more likely to actually create usable solutions/programs.

###### 20.06.2018

*A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods*

*A source code similarity assessment system for functional programming languages based on machine learning and data aggregation methods*

##### Maciej Bartoszuk ^{[WUT] }

This presentation deals with the problem of detecting similar source codes (also known as clones) in functional languages, especially in R. Code clones' detection is useful in e.g. programming tutoring. This work introduces a new approach to evaluate performance of proposed methods. What is more, the current state-of-the-art code clone detection approaches based on Program Dependence Graphs (PDG) rely on some exponential-time subroutines. In this work, the performance of a polynomial-time algorithm based on the Weisfeier--Lehman method for finding similar graphs is examined. Since all the algorithms for comparing source codes focus on different code similarity aspects, the current work also proposes new similarity aggregation method (based on B-splines curves and surfaces) of the results generated by different approaches. The new models not only posses an intuitive interpretation, but also were experimentally shown to yield better results. In this work for the first time non symmetric measures were proposed, quantifying the degree to which one function is contained within a second one and vice versa were proposed.

###### 08.06.2018

Friday 12:15, SRI PAS room 200

*Bounded Fuzzy Possibilistic Method of Critical Objects Processing in Machine Learning*

*Bounded Fuzzy Possibilistic Method of Critical Objects Processing in Machine Learning*

##### Hossein Yazdani ^{[Wroclaw University of Science and Technology] }

Unsatisfactory accuracy of conventional methods used in machine learning is mostly caused by omitting the influence of important factors in learning procedures. Type of data objects, membership assignments, and distance or similarity functions are among others such parameters. Improper accuracy of prototype-based (centroid-based) methods and causes of misclassifications have been fully studied. Data objects are considered as fundamental keys in learning methods and knowing the exact type of objects leads to provide a suitable environment for learning algorithms. A new type of object called critical object has been introduced that plays the important role in each dataset. These objects are considered as the causes of misclassification in learning methods. Objects' movements have been also analyzed and then a new method to handle and track the objects' movements has been proposed. In other words, the main goal was to introduce a new method with special processing of critical objects. The new proposed method called Bounded Fuzzy Possibilistic Method (BFPM) addresses several issues that previous clustering/classification methods have not considered. In fuzzy clustering, the object's membership values should sum to 1. Hence, any data object may obtain full membership in at most one cluster. Possibilistic clustering methods remove this restriction. However, the BFPM method differs from previous fuzzy and possibilistic clustering approaches by allowing the membership function to take larger values with no restriction. Furthermore, in the BFPM method, a data object can obtain full membership in multiple clusters or even in all clusters. A new type of feature called as dominant has been introduced that is considered as another cause of misclassifications. And then new similarity functions called Weighted Feature Distance (WFD) and Prioritized Weighted Feature Distance (PWFD) have been proposed to cover diversity in vector and feature spaces, as well as handling the impact of dominant features.

In experimental verifications, the most well-known benchmark datasets available in the Internet that have been also used in tests described in many scientific publications are chosen. Fuzzy C-Means function and algorithms as well as advanced modified prototype-based methods have been compared with the proposed method. The new method was compared with conventional supervised and unsupervised learning methods in terms of accuracy. Promising results achieved by the experiments prove that the BFPM method ensures better accuracy than conventional learning methods due to taking into account the critical objects and dominant features.

###### 28.03.2018

*Aggregation through the poset glass*

*Aggregation through the poset glass*

##### Raúl Pérez-Fernández ^{[Ghent University] }

The aggregation of several objects into a single one is a common study subject in mathematics. Unfortunately, whereas practitioners often need to deal with the aggregation of many different types of objects (rankings, graphs, strings, etc.), the current theory of aggregation is mostly developed for dealing with the aggregation of values in a poset. In this presentation, we will reflect on the limitations of this poset-based theory of aggregation and "jump through the poset glass". On the other side, we will not find Wonderland, but, instead, we will find more questions than answers. Indeed, a new theory of aggregation is being born, and we will need to work together on this reboot for years to come.

###### 21.03.2018

*Should we introduce a ‘dislike’ button for academic papers?*

*Should we introduce a ‘dislike’ button for academic papers?*

##### Agnieszka I. Geras ^{[WUT] }

Citations scores and the h-index are basic tools used for measuring the quality of scientific work. Nonetheless, while evaluating academic achievements one rarely takes into consideration for what reason the paper was mentioned by another author - whether in order to highlight the connection between their work or to bring to the reader’s attention any mistakes or flaws. In my talk I will shed some insight into the problem of "negative" citations analyzing data from the Stack Exchange and using the proposed agent-based model. Joint work with M. Gągolewski and G. Siudem.

###### 14.03.2018

*Analysis of the Warsaw rail transport network*

*Analysis of the Warsaw rail transport network*

##### Antoni Ruciński ^{[WUT] }

Have you ever wondered how the transport system really works and what are the trams traffic rules? What is the impact of a single tram stop to the entire network? What would happen if the most significant stop is cancelled? No? Don't worry! I guess that thousand of people travelling by public transport everyday have not wondered neither.

Online and timetable data has made it possible to answer to the questions and indicate the most meaningful features of the network. On one hand various centrality measures have been calculated to locate the most influential nodes. On the other, percentage share of individual tram lines or low-floor trams has given information about the traffic organisation.

What is the trams' ability to adhere to a timetable? Which lines are the most delayed and why? Don't know? Come and listen to my speech!

###### 26.01.2018

*Stochastic properties of the Hirsch index and other discrete Sugeno integrals*

*Stochastic properties of the Hirsch index and other discrete Sugeno integrals*

##### Marek Gagolewski ^{[WUT & SRI PAS] }

Hirsch's h-index is perhaps the most popular citation-based measure of scientific excellence. Many of its natural generalizations can be expressed as simple functions of some discrete Sugeno integrals. In this talk we shall review some less-known results concerning various stochastic properties of the discrete Sugeno integral with respect to a symmetric normalized capacity, i.e., weighted lattice polynomial functions of real-valued random variables - both in i.i.d. (independent and identically distributed) and non-i.i.d. (with some dependence structure) cases. For instance, we will be interested in investigating their exact and asymptotic distributions. Based on these, we can, among others, show that the h-index is a consistent estimator of some natural probability distribution's location characteristic. Moreover, we can derive a statistical test to verify whether the difference between two h-indices (say, h'=7 vs. h''=10 in cases where both authors published 40 papers) is actually significant.

###### 05.01.2018

*Measuring the efficacy of league formats in ranking football teams*

*Measuring the efficacy of league formats in ranking football teams*

##### Jan Lasek ^{[deepsense.ai]}

Choosing between different tournament designs based on their accuracy in ranking teams is an important topic in football since many domestic championships underwent changes in the recent years. In particular, the transformations of Ekstraklasa -- the top-tier football competition in Poland -- is a topic receiving much attention from the organizing body of the competition, participating football clubs as well as supporters. In this presentation we will discuss the problem of measuring the accuracy of different league formats in ranking teams. We will present various models for rating teams that will be next used to simulate a number of tournaments to evaluate their efficacy, for example, by measuring the probability of the best team win. Finally, we will discuss several other aspects of league formats including the influence of the number of points allocated for a win on the final league standings.

###### 24.11.2017

*How accidental scientific success is?*

*How accidental scientific success is?*

##### Grzegorz Siudem ^{[ WUT ] }

Since the classic work of de Sola Price the rich get richer rule is well known as a most important mechanism governing the citation network dynamic. (Un-)Fortunatelly it is not sufficient to explain every aspect of the bibliometric data. Using the proposed agent-based model for the bibliometric networks we will shed some light on the problem and try to answer the important question from the title. Joint work with A. Cena, M. Gagolewski and B. Żogała-Siudem.