Getting started with statnipokladna

Note for Czech speakers: a previous note covering similar information, sometimes in more depth, is available in the Czech guide to the data.

This vignette provides background on the data that statnipokladna provides access to. Make sure you read the relevant parts before analysing the data: it can be quite complicated and it is easy to make mistakes and end up with wrong numbers.

The nature of the data

The data published on https://monitor.statnipokladna.cz/datovy-katalog/ are exports from a much larger database used by Státní pokladna (SP), a comprehensive budgeting and financial/accounting reporting system for most Czech public organisations.

In some ways the data is a secondary product not always dogfooded by the data maintainers, so sometimes (though not often) there are small inconsistencies on formats and content.

Several datasets are not published in open data but are available in the analytical tool (only in Czech) or the web-based overviews. This relates to data on budget preparation and budgetary responsibility, respectively.

Exports and time periods

How data for different time periods is exported differs by dataset. This has significant implications for how you get to usable full-year numbers or time series in different tables.

For budget datasets finm and finu (tables budget-central-old* and budget-local*) and all accounting reports, the data dump for month N will contain data describing what was reported at month M. The period_vykaz column will contain only one unique value. (For budget data, the budget_adopted column does not change throughout the year, but the others do. The budget_spending column contains spending up to month M.) To see time series of spending throughout a year, you need to download multiple datasets (with sp_get_table()). To get a full-year’s spending, run sp_get_table("table_id", year = "YYYY") to get the 12th month.

In the newer central budget reports, budget-central (dataset misris) and budget-state-funds (dataset finsf), data for month M will contain separate rows for spending in (not up to) each month of the year up to month M. To get cumulative spending up to month M in year Y, you need to only download the dataset for month M and then sum the rows in that table. To get a full-year’s spending, run sp_get_table("table_id", year = "YYYY") to get the 12th month’s release, containing data for all months, then sum the values across the months contained in the resulting data frame. If you need consolidation (i.e. summing up data from multiple organisations), you need to do this before summing anything up.

Types of data

Zooming in on the data that is available from the open data dumps, there are two kinds:

Codelists are generally long lists of metadata items. These allow you to “translate” codes contained in return data. Codes identify individual lines, telling you which categories the value in that line belongs to, often across multiple categorisations (see Working with budget data.)

Codelists and time periods

One thing to bear in mind is that codelist data contains all items in that codelist, regardless of the time periods for which the item is applicable. That means that for a given code ID in a codelist, there may be duplicates distinguishable only by their validity range.

The sp_add_codelist() function tries to take care of this for you, but may not success if there are oddities in the data, e.g. one code with two items valid for the same date. In that case you get an informative error message; you should then resolve the duplicates manually using data returned by sp_get_codelist() and supply the edited data to sp_add_codelist() as an object.

Primary and secondary codelists

Note also that there are two kinds of codelists: some can be joined to core data, others are joined to another codelist, in effect decoding some items in that codelist. Currently the package cannot handle the latter type automatically, so you need to save the output of sp_get_codelist() into an object and joit it yourself using dplyr.

You can see the list of available codelists at https://monitor.statnipokladna.cz/datovy-katalog/ciselniky and in sp_codelists. When I refer to codelists below, I use the ID which you can supply as the "codelist_id" parameter to sp_add_codelist().

The codelist that you will generally want for comparing organisations is called “ucjed”; this contains metadata on all organisations covered by the data. It is huge so you will want to store it somewhere after you have retrieved (and processed and filtered if relevant.)

Note that some codelists can only be joined onto other codelists, e.g. the ucjed codelist, which contains most metadata on reporting organisations, has several columns named somethings_id. Those columns can be “decoded” using using the respective something codelist.

Return data come broadly in two kinds: budgetary and accounting - see below for how to work with each of them, as the logic differs markedly. Return data come in ZIP files, which can contain on or more files, each with a return (I call them tables for more generality, hence the sp_get_table() function.)

You can see the list of returns which statnipokladna can process in the sp_tables data frame.

While the sp_get_table() returns a data frame with column names understandable in English, codelists only exist in Czech.

Basic structure of data in published ZIP files

You should not have to worry about this if you use the sp_get_table() function, but it explains what you see when you use sp_get_dataset() directly.

Generally, each CSV file will contain data for all organisations which provided the return; sometimes this will be split into two files, as in some accounting returns.

The column named ico in the output of sp_get_table() is the unique identifier of an organisation.

Often, there will be multiple returns in one dataset (ZIP file), each in a differently named CSV file. This is particularly the case for budget data.

All CSV files contain some common columns, notably those identifying the organisation, its geography, and its place in the public sector. Note that the geography is derived from the seat of the organisation, not the location of the spending.

You can see the list of datasets (ZIP files) which statnipokladna can process in the sp_datasets data frame.

Central budgets and codelists

The more recent central budgets stored in the budget-central table (misris dataset) require a different codelist for sectoral (“paragraf”) breakdown. Use the paragraf_long codelist, not paragraf. If you use paragraf, sp_add_codelist() will technically work but the joined columns will be empty.

Coverage

Put simply, the data covers organisations which report into the SP system, which broadly means all public organisations (I am sure there is a legal definition and I am sure there are exceptions, but I do not know the detail.) There are codelists “druhuj” (organisation type) and “poddruhuj” (subtype) which let you see what type an organisation belongs to, and “forma” will let you know about its legal form.

This means the data covers both state (I call them “central” in sp_tables) and local organisations incl. all 6000+ municipalities; and both public organisations themselves and their related organisations (both “podřízené” and “příspěvkové”, i.e. subsidiary and sponsored, the latter being more independent). Commercial entities owned by public organisations are included at least as codelists items and presumably at least their accounting returns will be included in the data, but I have not researched whether there are cut-offs as to what stake counts as ownership etc.

State enterprises (a special legal form) are included. And Budvar, the Czech state brewery which also has a special legal form (national company) also makes an appearance in the codelist… NB: I am not sure which return data, if any, they are included in.

State funds - a special kind of non-commercial public organisation which disburses money for specific purposes, such as transport infrastructure, are included.

One special distinction to note is Chapters (kapitola) and OSS (organizační složky státu). Chapters are top-level budget lines, typically managed by a ministry but including many organisations. OSS are quasi-organisations, some are ministries, some not, and some of which manage chapters.

Periodicity

Report/return data is published in differing periodicities, depending on the kind of return and the kind of organisation.

Generally:

See the list of currently available data releases at https://monitor.statnipokladna.cz/datovy-katalog/transakcni-data.

A ZIP file is published for each time period of each data dump. sp_get_table() handles this for you and returns one long data frame containing data for all time periods in the year or month parameters. In the resulting data frame, the per* columns mark the time period of the return.

The data becomes available approx. 3 months after the end of each period; budget data seems to be available more quickly than accounting data.

Working with budget data

Budget data has a special form: the available data files provide all sorts of breakdowns in a single file, in long format, crossed between each other. This means that any individual value will make little sense - it will be e.g. money from a particular sector, from a particular source, either capital or current. Each of these categorisations - of which there are a few more - have multiple levels of detail.

What this means is that to get a meaningful number, you need to do a lot of summarising. It also means that if you are only interested in, say, the capital spend of an organisation, you will need to:

Unless you are interested in further detail, you do not need to add any other codelists.

The sectoral breakdown (“paragrafy”) is contained in codelist “paragraf” and the functional breakdown is in “polozka” (not to be confused with “polvyk”).

The budget datasets contain columns with monetary values at each phase of the budgetary cycle: in the output of sp_get_table(), these are budget_adopted for the original budget, budget_amended for plan after amendments, budget_final (where available) and budget_spending for the final reported spend. The first of these does not change throughout the year.

Consolidation

If you are summing data across multiple organisations, you will need to take care of consolidation. Typically this concerns relations between levels of government, e.g. a region gives grants to municipalities which then spend them, and consolidation ensures you only count the money once as it flows outside the public sector.

The metadata allowing consolidation is contained in the “polozka” codelist in columns kon_*. These are TRUE/FALSE and you consolidate data by excluding certain levels (i.e. filtering out items for which that kon_* column is TRUE).

This is necessary even for seemingly smaller entities, like some municipalities, if they have any organisations which they establish, like schools - because those report their own money. The correct way to get their budgetary figures is to filter for their geography, then summarise across all organisations included in that geography, and consolidate.

Note that for the more recent central budget data (dataset misris, table budget=central) you need to join the polozka codelist to the raw data (i.e. containing rows for individual months), not to a table that has been summed by year.

Once done, double check your sum with the figure published on the Monitor if available.

Working with accounting returns

Accounting data is a bit more straightforward as it follows generally known accounting practices, i.e. you will find balance sheets, profit-and-loss accounts, etc.

The only specialised codelist that you will need specifically for accounting data is called to make sense of this data ‘polvyk’; it contains something akin to a chart of accounts.

What this package does not yet provide access to

As noted above, not all open data dumps from SP can currently be processed by statnipokladna. See sp_datasets for a list of those which can be.

What is not published in open data

As note above, some data available on the web presentation or analysis tool at https://monitor.statnipokladna.cz/ are not accessible in a documented way as open data:

In addition, there are some kinds of information which are held in other datasets:

As far as I am aware, there is no consistent dataset on the geographical breakdown of public/state spending by place of actual spend.