Neparametrické testování dvou a více
náhodných výběrů z neznámého rozdělení
pravděpodobností s využitím ESRI produktů
autor: Radek BRABLEC | vedoucí práce: Mgr. Pavel TUČEK, Ph.D.
Summary
Mathematical statistics is a scientific discipline which represents a boundary between the descriptive statistics and the applied mathematics. When using the method of probability theory it tries to estimate the features of the destribution of the observed data. Parametric and non-parametric testing belong to these methods. The parametric tests presuppose concrete data distribution using a given parameter for the calculation. However, if we do not know the data distribution, we use non-parametric tests for the calculation. Non-parametric testing is also used with the data of the ordinal scale. The weak point of the non-parametric tests is a smaller predicative value especially when having smaller amount of measured data.
The aim of this work was to study non-parametric testing methods. We compared the hypothetical non-parametric data with the practical results we obtained during a three-day survey of the municipal bus operation in citi. The municipal bus operation in the above mentioned district is provided by 4 transport companies. The public transport in the town is provided by the transport company C. According to the regulation of the Integrated system it is possible to use a valid travel document when travelling with another transport company. As a part of the survey we counted the number of people getting on and getting off the bus at individual bus stops for a certain bus route or for a cross using of travel documents from different transport companies. The results we obtained were further used for a mutual comparison within various parameters in the enviroment of the software R. We visualized the obtained data with the help of the product ESRI.
In order to achieve the aims of our work it was necessary to study the mathematical statistics and above all the individual types of non-parametric tests. In the theoretical part we focused on the explanation of normality test, which predestines whether it is possible to use data sets for further testing. We also concentrated on the theories of the most used one, two or more selective non-parametric tests, which we included in the practical calculations. At the end of the theoretical part we focused on selected software that uses non-parametric testing. This part of the task was, concerning the time, the most demanding because we had to study the literature related to the topic to understand the problematic thoroughly.
In the practical part we compared the measured data from the statistical survey. As the survey was realized by more voluntary students, the outcomes were burdened with a certain statistical error. That could be caused by a lack of information or misunderstanding of the instructions.
The data processing went without major complications. At first, we carried out data logging and conditioning with the help of software MS Excel. Then we processed the data and grouped them according to the parameters, which we compared. The resulting data sets were later put in the text editor. Finally, the testing itself took place in the enviroment of the statistical software R, where the individual data items were loaded from the notebooks, and the initial normality test was carried out with the help of Shapiro-Wilk test. Shapiro-Wilk normality test proved to us that the measured data could be used for further testing.
A two-related-sample Wilcoxonův test was used for two data sets. The results showed that the number of passangers getting on was different from the number getting off. The findings proved true with both A and B. The result of this test was for us rather surprising because we expected the same numbers of getting on passangers and those getting off in the town of citi. Differences were apparent with cross tickets bought from the companies A and C. These tickets were used on the buses of the company B. It is illustrated in the appendix no. 3, which expresses a higher number of passangers using the tickets of the transport company C, which in citi runs the town public transport.
To compare more data sets we used Kruskal-Wallis test, which serves to compare several sets. The tests showed us that the numbers of getting on and getting off passangers in the time periods (in the morning, afternoon and evening) are different regarding the company B, whereas as for the company A the obtained data were evaluated as identical. Further testing where we compared the number of passangers getting on and off on individual days (Tuesday, Wednesday and Thursday) led us to the same conclusion. .
When general cross using of travel documents it various day times and days the data were evaluated as comparable. The measured data were different only in one case and that was when cross using of tickets on the buses of B in individual parts of a day (in the morning, afternoon and evening).
When testing cross travel documents with bus routes of the company A we came to the hypothesis about the identity of tested data. We were surprised that the results of the test of using cross tickets on the buses of the company B on Tuesdays, Wednesdays and Thursdays are comparable with the other tests realized on the buses of B. The reason might be that these buses are regularly used by students and workers.
The tests outcomes are not always objective and it is because of a small sample, e.g. testing in the company A or observing the number of routes on Wednesdays and Thursdays with the companies B and A. The reason for such a small sample on these days was obtaining different data because during the first day of our survey we concentrated on all bus routes, while the next days we looked only at some. The question is what results we would have got, if all routes would have been calculated every day.
The tests outcomes could be applied in other towns as for example citi1 or citi2, which also belong to the Integrated transport system of region. Statistical survey concerning the public transport in the above mentioned towns is supposed to be carried out in the autumn and spring next calendar year. It will be interesting to observe how a bigger number of transport companies running their business within the district (in the town of citi1) will influence the statistical survey. Also a bigger number of bus stops of the municipality bus operation within the public transport in the district (in citi2) might have an impact on the surfvey. Another problem could be the process of collecting data from the district of citi2 mainly because of the size of the district, the number of routes and a bigger number of passangers.
In conclusion, non-parametric tests are a suitable tool to compare a large amount of data. We found out that when comparing a smaller number of collected facts the non-parametric testing is not as exact as in the case of a big sample. As for the software application from the point of view of ESRI products the support for non-parametric testing is missing. Therefore, it is not possible to carry out the tests, which could be processed into background material for maps, directly in this environment.
As a part of our bachelor project we created a web site posted on the server of the Department of Geoinformatics UP.
With respect to the data protection, which were obtained from the transport companies and the company providing statistical surveys and evaluation, we cannot allow a third party to read the text and the data of this work.