| Title: | Creating Correspondence Tables Between Two Statistical Classifications |
|---|---|
| Description: | A candidate correspondence table between two classifications can be created when there are correspondence tables leading from the first classification to the second one via intermediate 'pivot' classifications. The correspondence table between two statistical classifications can be updated when one of the classifications gets updated to a new version. |
| Authors: | Vasilis Chasiotis [aut] (Department of Statistics, Athens University of Economics and Business), Photis Stavropoulos [aut] (Quantos S.A. Statistics and Information Systems), Martin Karlberg [aut], Mátyás Mészáros [cre], Martina Patone [aut], Erkand Muraku [aut], Clement Thomas [aut], Loic Bienvenu [aut], Mauro Baldacchini [aut], Khadija Sossey-Lallemand [aut], Mark van der Loo [aut] |
| Maintainer: | Mátyás Mészáros <[email protected]> |
| License: | EUPL |
| Version: | 1.0.2 |
| Built: | 2026-05-18 11:14:58 UTC |
| Source: | https://github.com/eurostat/correspondencetables |
Aggregates numeric values from a source classification A into a target classification B using a correspondence table AB. Each value in A is redistributed to one or more B codes according to weights in AB (if available), or split equally when no valid weights exist.
aggregateCorrespondenceTable(AB, A, B = NULL)aggregateCorrespondenceTable(AB, A, B = NULL)
AB |
A |
A |
A |
B |
Optional |
This function is used when data expressed in one classification must be converted into another classification using an A→B correspondence table.
The workflow is:
Detect and standardize column names in AB, A and B
(case-insensitive).
For each code in A, determine all corresponding B codes.
Apply proportional allocation:
If AB contains a numeric weight column, weights
are normalized per source code.
If weights are missing, invalid, or sum to zero, values are split equally across all mapped B codes.
Values are multiplied by the resulting weights and aggregated by B code.
If B is provided, the result is aligned to its list of codes
(unmatched codes receive 0).
This function is appropriate for tasks such as converting statistical datasets from NACE to CPA, CPA to CN, CPC to HS, or any situation where values have to be re-expressed according to a different classification system.
A list with:
result: data.frame with aggregated values for each B code
(and any additional columns from B, if provided),
mapping: standardized A→B mapping used internally,
diagnostics: list with total input value, mapped value,
coverage ratio, and codes in A that could not be mapped.
retrieveCorrespondenceTable for downloading AB tables.
## Simple example (equal split) AB <- data.frame( from_code = c("A1","A1","A2"), to_code = c("B1","B2","B1") ) print(AB) A_tbl <- data.frame( code = c("A1","A2"), value = c(100, 50) ) print(A_tbl) result <- aggregateCorrespondenceTable(AB, A_tbl) print(result$result) print(result$mapping)## Simple example (equal split) AB <- data.frame( from_code = c("A1","A1","A2"), to_code = c("B1","B2","B1") ) print(AB) A_tbl <- data.frame( code = c("A1","A2"), value = c(100, 50) ) print(A_tbl) result <- aggregateCorrespondenceTable(AB, A_tbl) print(result$result) print(result$mapping)
Perform analysis on a correspondence table using its associated classifications to generate various statistics. It checks the validity of the input data, identifies components, calculates correspondence types, and creates summary tables.
analyseCorrespondenceTable( AB, Aname = NULL, Bname = NULL, A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE )analyseCorrespondenceTable( AB, Aname = NULL, Bname = NULL, A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE )
AB |
A mandatory argument containing a data frame provided by the user
with the correspondence table data. The table must contain at least two
columns containing the two classifications |
Aname |
Optional. A character string specifying the name of the column in 'AB' that should be used as the source classification ('Acode'). If 'NULL', the function assumes that the first column of 'AB' represents the source classification. |
Bname |
Optional. A character string specifying the name of the column in 'AB' that should be used as the target classification ('Bcode'). If 'NULL', the function assumes that the second column of 'AB' represents the target classification. |
A |
A data frame containing the source classification data with
an |
B |
A data frame containing the target classification data with
a |
longestAcodeOnly |
Logical. If |
longestBcodeOnly |
Logical. If |
A list containing two data frames: Inventory and Analysis.
Inventory contains statistics related to components, correspondence
types, and source/target positions. Analysis contains statistics for
each class in the correspondence table.
# Use data from the folder extdata/test AB = read.csv(system.file("extdata/test", "ExempleAnnexe.csv", package = "correspondenceTables")) head(AB) result <- analyseCorrespondenceTable( AB = AB, Aname = NULL, Bname = NULL, A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE ) print(result$Inventory) print(result$Analysis) # the name of the columns does not matter: AB_mod = AB colnames(AB_mod) = c("a", "b") result_mod <- analyseCorrespondenceTable( AB = AB_mod, Aname = NULL, Bname = NULL, A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE ) print(result_mod$Inventory) print(result_mod$Analysis) # examples with using Aname and Bname # fist we add two more column to the same dataset set.seed(123) AB_mod2 = AB_mod AB_mod2 <- cbind(new_first_col = sample(1:100, nrow(AB_mod2), replace = TRUE), AB_mod2) AB_mod2 <- cbind( AB_mod2[1:2], new_middle_col = sample(1:100, nrow(AB_mod2), replace = TRUE), AB_mod2[3:ncol(AB_mod2)] ) colnames(AB_mod2) result_mod2 <- analyseCorrespondenceTable( AB = AB_mod2, Aname = "a", Bname = "b", A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE ) print(result_mod2$Inventory) print(result_mod2$Analysis) # check if we obtained the same results in all the 3 examples first = identical(result$Inventory, result_mod$Inventory) second = identical(result_mod$Inventory, result_mod2$Inventory) first && second # TRUE # the following won't work as we changed column names in AB_mod, # and added new columns in AB_mod2 but the results are the same. first = identical(result$Analysis, result_mod$Analysis) second = identical(result_mod$Analysis, result_mod2$Analysis) first && second# Use data from the folder extdata/test AB = read.csv(system.file("extdata/test", "ExempleAnnexe.csv", package = "correspondenceTables")) head(AB) result <- analyseCorrespondenceTable( AB = AB, Aname = NULL, Bname = NULL, A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE ) print(result$Inventory) print(result$Analysis) # the name of the columns does not matter: AB_mod = AB colnames(AB_mod) = c("a", "b") result_mod <- analyseCorrespondenceTable( AB = AB_mod, Aname = NULL, Bname = NULL, A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE ) print(result_mod$Inventory) print(result_mod$Analysis) # examples with using Aname and Bname # fist we add two more column to the same dataset set.seed(123) AB_mod2 = AB_mod AB_mod2 <- cbind(new_first_col = sample(1:100, nrow(AB_mod2), replace = TRUE), AB_mod2) AB_mod2 <- cbind( AB_mod2[1:2], new_middle_col = sample(1:100, nrow(AB_mod2), replace = TRUE), AB_mod2[3:ncol(AB_mod2)] ) colnames(AB_mod2) result_mod2 <- analyseCorrespondenceTable( AB = AB_mod2, Aname = "a", Bname = "b", A = NULL, B = NULL, longestAcodeOnly = FALSE, longestBcodeOnly = FALSE ) print(result_mod2$Inventory) print(result_mod2$Analysis) # check if we obtained the same results in all the 3 examples first = identical(result$Inventory, result_mod$Inventory) second = identical(result_mod$Inventory, result_mod2$Inventory) first && second # TRUE # the following won't work as we changed column names in AB_mod, # and added new columns in AB_mod2 but the results are the same. first = identical(result$Analysis, result_mod$Analysis) second = identical(result_mod$Analysis, result_mod2$Analysis) first && second
List all classification schemes available on the selected repository (CELLAR, FAO, or both). Returns, for each scheme, its prefix, identifier, URI, English title, and the list of languages detected on its concepts.
classificationList(endpoint = "ALL", showQuery = FALSE)classificationList(endpoint = "ALL", showQuery = FALSE)
endpoint |
Character scalar. One of "CELLAR", "FAO" or "ALL". "ALL" queries both repositories and returns a named list. |
showQuery |
Logical; if TRUE, returns also the SPARQL query used. |
If endpoint is "CELLAR" or "FAO" and showQuery = FALSE: a data.frame with columns Prefix, ConceptScheme, URI, Title, Languages.
If showQuery = TRUE: a list with "ClassificationList" and "SPARQL.query".
If endpoint = "ALL": a named list with $CELLAR and $FAO. If showQuery = TRUE, each element is itself a list (ClassificationList + SPARQL.query).
## Not run: head(classificationList("CELLAR")) res <- classificationList("FAO", showQuery = TRUE) names(res) ## End(Not run)## Not run: head(classificationList("CELLAR")) res <- classificationList("FAO", showQuery = TRUE) names(res) ## End(Not run)
This function performs a series of structural and logical quality-control checks on a hierarchical classification. It verifies the consistency of the hierarchical levels, the presence of parent–child relationships, the uniqueness of labels, the compliance with admissible coding rules, and the sequencing of children within each level. The output is a list of data frames flagging potential issues.
classificationQC( classification, lengths, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL )classificationQC( classification, lengths, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL )
classification |
A |
lengths |
A |
fullHierarchy |
Logical.
If |
labelUniqueness |
Logical.
When |
labelHierarchy |
Logical.
When
|
singleChildCode |
A |
sequencing |
A |
A list of data frames flagging potential issues in the classification:
QC_output: full augmented table with computed levels,
code segments, and all compliance flags;
QC_noLevels: rows for which hierarchical level could not be
determined;
QC_orphan: positions that lack an expected parent;
QC_childless: positions expected to have children but do not;
QC_duplicatesLabel: positions with duplicate labels within
the same level;
QC_duplicatesCode: duplicated codes;
QC_singleChildMismatch: label inconsistencies for
single-children relationships;
QC_singleCodeError: single children with invalid codes (if
singleChildCode is provided);
QC_multipleCodeError: non-single children with invalid codes
(if sequencing is provided);
QC_gapBefore: missing expected codes in sequence;
QC_lastSibling: flag for last sibling in sequencing rules.
classification <- read.csv(system.file("extdata/test", "Nace2_long.csv", package = "correspondenceTables")) lengths <- data.frame(charb = c(1,2,3,5), chare = c(1,2,4,5)) Output <- classificationQC( classification = classification, lengths = lengths, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL ) print(Output$QC_output) print(Output$QC_orphan) print(Output$QC_childless) print(Output$QC_duplicatesLabel) print(Output$QC_duplicatesCode)classification <- read.csv(system.file("extdata/test", "Nace2_long.csv", package = "correspondenceTables")) lengths <- data.frame(charb = c(1,2,3,5), chare = c(1,2,4,5)) Output <- classificationQC( classification = classification, lengths = lengths, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL ) print(Output$QC_output) print(Output$QC_orphan) print(Output$QC_childless) print(Output$QC_duplicatesLabel) print(Output$QC_duplicatesCode)
Returns a data frame listing available correspondence tables (mappings between statistical classifications) published by one or more supported online services.
Use this function to discover valid 'prefix' and 'ID_table' values before calling 'retrieveCorrespondenceTable()'.
correspondenceTableList(endpoint = "ALL", showQuery = FALSE)correspondenceTableList(endpoint = "ALL", showQuery = FALSE)
endpoint |
Character. Which service(s) to query. One or more of the supported endpoints (e.g. "CELLAR", "FAO"), or "ALL" to query all available services. Case-insensitive. If multiple endpoints are supplied (including "ALL"), a **named list** is returned; otherwise a single 'data.frame'. |
showQuery |
Logical. If 'TRUE', the SPARQL queries used are emitted via 'message()' (suppress with 'suppressMessages()'). |
Read-only queries; no side effects (no file writes or caching).
If a single endpoint is requested, a 'data.frame' (one row per correspondence). If multiple endpoints are requested (including "ALL"), a **named list** of data frames, one per endpoint.
retrieveCorrespondenceTable()
## Not run: # List correspondence tables available in CELLAR ct <- correspondenceTableList(endpoint = "CELLAR") head(ct) # List correspondence tables available in FAO ct_fao <- correspondenceTableList(endpoint = "FAO") head(ct_fao) # Get correspondence tables from both repositories ct_all <- correspondenceTableList(endpoint = "ALL") names(ct_all) ## End(Not run)## Not run: # List correspondence tables available in CELLAR ct <- correspondenceTableList(endpoint = "CELLAR") head(ct) # List correspondence tables available in FAO ct_fao <- correspondenceTableList(endpoint = "FAO") head(ct_fao) # Get correspondence tables from both repositories ct_all <- correspondenceTableList(endpoint = "ALL") names(ct_all) ## End(Not run)
Queries the selected service (e.g., CELLAR or FAO) to retrieve the structure of a classification scheme (levels, depth, and concept rows), with language-aware label fallback. You can return a compact summary table, a detailed concept-level table, or both.
The function first resolves the input classification using the in-package registry
returned by classificationList(), then retrieves the structure and returns
it as data frames. Network errors are handled with retry logic.
dataStructure( endpoint, prefix, conceptScheme = NULL, language = "en", showQuery = FALSE, return = c("summary", "details", "both"), timeout_sec = 60, retries = 3 )dataStructure( endpoint, prefix, conceptScheme = NULL, language = "en", showQuery = FALSE, return = c("summary", "details", "both"), timeout_sec = 60, retries = 3 )
endpoint |
Character scalar. Accepted values are |
prefix |
Character. A classification prefix used in the registry
(e.g. |
conceptScheme |
Character or |
language |
Character scalar. Two-letter language code used to select
labels (default: |
showQuery |
Logical. If |
return |
One of |
timeout_sec |
Numeric. HTTP timeout in seconds. Default: |
retries |
Integer. Number of automatic retries on transient HTTP errors.
Default: |
Resolution of the scheme
The function calls classificationList(endpoint) and tries to match either
prefix or conceptScheme using normalized keys. On success it extracts
the classification namespace and scheme identifier needed to compose requests.
Language fallback
Labels for both the classification levels and concepts follow a multi-step
fallback: requested language -> "en" -> no-language literal -> local name.
A data.frame (for "summary" or "details"), or a
list(summary, details) when return = "both".
Columns:
All returned tables include a leading Prefix column (lower-case token).
summary: Concept_Scheme, Depth, Level, Count.
details: Concept, Code, Label, Depth, Level,
BroaderList, BroaderCodeList.
If showQuery = TRUE, the return value is a list containing:
resolved: metadata (endpoint, inputs, scheme_id, ns_uri, and optionally title)
the query text (one or two elements depending on return)
the retrieved table(s)
Errors if endpoint is not "CELLAR" or "FAO", or if the
scheme cannot be resolved from the registry.
Warns (and still returns rows) if no explicit levels are declared.
Propagates HTTP errors from the service after retry policy.
## Not run: ds_cn <- dataStructure( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en", return = "summary" ) head(ds_cn) ## End(Not run)## Not run: ds_cn <- dataStructure( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en", return = "summary" ) head(ds_cn) ## End(Not run)
Create a candidate correspondence table between two classifications based on their correspondences with intermediate classifications
newCorrespondenceTable( Tables, Reference = "none", MismatchTolerance = 0.2, Redundancy_trim = TRUE, Progress = TRUE )newCorrespondenceTable( Tables, Reference = "none", MismatchTolerance = 0.2, Redundancy_trim = TRUE, Progress = TRUE )
Tables |
A string of type character containing the name of a csv file which contains the names of the files that contain the classifications and the intermediate correspondence tables OR a list of vectors with the names of the dataframes classifications and the intermediate correspondence tables (see "Details" below). |
Reference |
The reference classification among A and B. If a classification is the reference to the other, and hence
hierarchically superior to it, each code of the other classification is expected to be mapped to at most one code
of the reference classification. The valid values are |
MismatchTolerance |
The maximum acceptable proportion of rows in the candidate correspondence table which contain
no code for classification A or no code for classification B. The default value is |
Redundancy_trim |
An argument in the function containing the logical values |
Progress |
An argument in the function containing the logical values |
File and file name requirements:
The file that corresponds to argument Tables and the files to which the contents of Tables
lead, must be in csv format with comma as delimiter. If full paths are not provided, then these files must
be available in the working directory. No two filenames provided must be identical.
The file that corresponds to argument Tables must contain filenames, and nothing else, in
a × table, where , a positive integer, is the number of "pivot" classifications.
The cells in the main diagonal of the table provide the filenames of the files which contain, with this order,
the classifications , , and . The off-diagonal directly above the main
diagonal contains the filenames of the files that contain, with this order, the correspondence tables
, {:, } and . All other cells of the table
must be empty.
Classification table requirements:
Each of the files that contain classifications must contain at least one column and at least two rows. The first column contains the codes of the respective classification. The first row contains column headers. The header of the first column is the name of the respective classification (e.g., "CN 2021").
The classification codes contained in a classification file (expected in its first column as mentioned above) must be unique. No two identical codes are allowed in the column.
If any of the files that contain classifications has additional columns the first one of them is assumed to contain the labels of the respective classification codes.
Correspondence table requirements:
The files that contain correspondence tables must contain at least two columns and at least two rows.
The first column of the file that contains A: contains the codes of classification A. The second column
contains the codes of classification . Similar requirements apply to the files that contain
:, and B:. The first row of each of the files that contain
correspondence tables contains column headers. The names of the first two columns are the names of the respective
classifications.
The pairs of classification codes contained in a correspondence table file (expected in its first two columns as mentioned above) must be unique. No two identical pairs of codes are allowed in the first two columns.
Interdependency requirements:
At least one code of classification A must appear in both the file of classification A and the file of
correspondence table A:.
At least one code of classification B must appear in both the file of classification B and the file of
correspondence table B:, where , , is the number of pivot classifications.
If there is only one pivot classification, , at least one code of it must appear in both the file of
correspondence table A: and the file of correspondence table B:.
If the pivot classifications are with then at least one code of must appear in
both the file of correspondence table A: and the file of correspondence table :, at least one
code of each of the , (if ) must appear in both the file of correspondence
table : and the file of correspondence table :, and at least one code of
must appear in both the file of correspondence table : and the file of correspondence table
B:.
Mismatch tolerance:
The ratio that is compared with MismatchTolerance has as numerator the number of rows in the candidate
correspondence table which contain no code for classification A or no code for classification B and as denominator
the total number of rows of this table. If the ratio exceeds MismatchTolerance the execution of the function
is halted.
If any of the conditions required from the arguments is violated an error message is produced and execution is stopped.
newCorrespondenceTable() returns a list with two elements, both of which are data frames.
The first element is the candidate correspondence table A:B, including the codes of all "pivot" classifications, augmented with flags "Review" (if applicable), "Redundancy", "Unmatched", "NoMatchFromA", "NoMatchFromB" and with all the additional columns of the classification and intermediate correspondence table files.
The second element contains the names of classification A, the "pivot" classifications and classification B as read from the top left-hand side cell of the respective input files.
The "Review" flag is produced only if argument Reference has been set equal to "A" or "B". For
each row of the candidate correspondence table, if Reference = "A" the value of "Review" is equal to
1 if the code of B maps to more than one code of A, and 0 otherwise. If Reference = "B"
the value of "Review" is equal to 1 if the code of A maps to more than one code of B, and 0 otherwise.
The value of the flag is empty if the row does not contain a code of A or a code of B.
For each row of the candidate correspondence table, the value of "Redundancy" is equal to 1 if the row
contains a combination of codes of A and B that also appears in at least one other row of the candidate
correspondence table.
When "Redundancy_Trim" is equal to FALSE the "Redundancy_keep" flag is created to identify with value 1
the records that will be kept if trimming is performed.
For each row of the candidate correspondence table, the value of "Unmatched" is equal to 1 if the row
contains a code of A but no code of B or if it contains a code of B but no code of A. The value of the flag is
0 if the row contains codes for both A and B.
For each row of the candidate correspondence table, the value of "NoMatchFromA" is equal to 1 if the row
contains a code of A that appears in the table of classification A but not in correspondence table A:. The
value of the flag is 0 if the row contains a code of A that appears in both the table of classification A and
correspondencetable A:. Finally, the value of the flag is empty if the row contains no code of A or if it
contains a code of A that appears in correspondence table A: but not in the table of classification A.
For each row of the candidate correspondence table, the value of "NoMatchFromB" is equal to 1 if the row
contains a code of B that appears in the table of classification B but not in correspondence table B:. The
value of the flag is 0 if the row contains a code of B that appears in both the table of classification B and
correspondence table B:. Finally, the value of the flag is empty if the row contains no code of B or if it
contains a code of B that appears in correspondence table B: but not in the table of classification B.
The argument "Redundancy_trim" is used to delete all the redundancies which are mapping correctly.
The valid logical values for this argument in the candidate correspondence table are TRUE or FALSE.
If the selected value is TRUE, all redundant records are removed and kept exactly one record for each unique combination.
For this retained record, the codes, the label and the supplementary information of the pivot classifications are replaced with
'multiple'. If the multiple infomration of the pivot classifications are the same, their value will not be replaced.
If the selected value is FALSE, no trimming is executed so redundant records are shown, together with the redundancy flag.
If the logical values are missing the implementation of the function will stop.
If the user wish to access the csv files with the sample data, they have two options:
Option 1: Unpack into any folder of their choice the tar.gz file into which the package has arrived. All sample datasets may be found in the "inst/extdata" subfolder of this folder.
Option 2: Go to the "extdata" subfolder of the folder in which the package has been installed in their PC's R
library. All sample datasets may be found there.
{ ## Not run: code_in <- data.frame(code_in = c("A","B")) code_mid <- data.frame(code_mid = c("B","C","D")) in_to_mid <- data.frame(code_in = c("A","B") , code_out = c("C","B")) code_out <- data.frame(code_out = c("B","C","D","E")) mid_to_out <- data.frame(code_mid = c("B","C","D") , code_out = c("B","C","D")) write.csv(code_in, file = "code_in.csv", row.names = FALSE) write.csv(code_mid, file = "code_mid.csv", row.names = FALSE) write.csv(in_to_mid, file = "in_to_mid.csv", row.names = FALSE) write.csv(mid_to_out, file = "mid_to_out.csv", row.names = FALSE) write.csv(code_out, file = "code_out.csv", row.names = FALSE) m <- matrix("",3,3) diag(m) <- c("code_in.csv","code_mid.csv","code_out.csv") m[1,2] <- "in_to_mid.csv" m[2,3] <- "mid_to_out.csv" write.table(m, file="merge_sequence.csv",row.names=FALSE,col.names=FALSE,sep=",") out <- newCorrespondenceTable("merge_sequence.csv",MismatchTolerance = 0.9) print(out$newCorrespondenceTable) print(out$classificationNames) ## End(Not run) }{ ## Not run: code_in <- data.frame(code_in = c("A","B")) code_mid <- data.frame(code_mid = c("B","C","D")) in_to_mid <- data.frame(code_in = c("A","B") , code_out = c("C","B")) code_out <- data.frame(code_out = c("B","C","D","E")) mid_to_out <- data.frame(code_mid = c("B","C","D") , code_out = c("B","C","D")) write.csv(code_in, file = "code_in.csv", row.names = FALSE) write.csv(code_mid, file = "code_mid.csv", row.names = FALSE) write.csv(in_to_mid, file = "in_to_mid.csv", row.names = FALSE) write.csv(mid_to_out, file = "mid_to_out.csv", row.names = FALSE) write.csv(code_out, file = "code_out.csv", row.names = FALSE) m <- matrix("",3,3) diag(m) <- c("code_in.csv","code_mid.csv","code_out.csv") m[1,2] <- "in_to_mid.csv" m[2,3] <- "mid_to_out.csv" write.table(m, file="merge_sequence.csv",row.names=FALSE,col.names=FALSE,sep=",") out <- newCorrespondenceTable("merge_sequence.csv",MismatchTolerance = 0.9) print(out$newCorrespondenceTable) print(out$classificationNames) ## End(Not run) }
Returns a character vector of 'PREFIX ...' lines used by package queries. The result always includes core vocabularies (SKOS/XKOS/RDFS/DC/RDF/XSD), and—when available—adds classification scheme namespaces discovered via 'classificationList(endpoint)'. You can filter to specific schemes with the 'prefix' argument.
prefixList(endpoint, prefix = NULL)prefixList(endpoint, prefix = NULL)
endpoint |
Character. Endpoint name (case-insensitive). Typically '"CELLAR"' or '"FAO"'. The set of supported names is defined by the package's internal endpoint registry (see '.valid_endpoints()'). |
prefix |
Character vector of scheme tokens to include (e.g., 'c("cn2022","nace2")'). If 'NULL' (default), returns all discovered schemes. |
- This function has no side effects and does not perform network I/O by itself. Discovery is delegated to 'classificationList()'.
- If 'classificationList()' is unavailable or fails, the function returns only the core vocabularies and emits a warning.
A character vector of SPARQL PREFIX declarations. Use 'paste(result, collapse = "\n")' to inject into a query string.
## Not run: # All known scheme prefixes for the endpoint pl <- prefixList("CELLAR") cat(paste(pl, collapse = "\n")) # Only include CN2022 and NACE2 if they exist pl2 <- prefixList("CELLAR", prefix = c("cn2022", "nace2")) ## End(Not run)## Not run: # All known scheme prefixes for the endpoint pl <- prefixList("CELLAR") cat(paste(pl, collapse = "\n")) # Only include CN2022 and NACE2 if they exist pl2 <- prefixList("CELLAR", prefix = c("cn2022", "nace2")) ## End(Not run)
Retrieves the complete structure of a statistical classification published in
either CELLAR or FAO repositories based on a Prefix–ConceptScheme pair obtained
using classificationList().
The resulting table is suitable for classification browsing, integrity checks, hierarchical analysis, documentation, and downstream correspondence mapping.
retrieveClassificationTable( endpoint, prefix, conceptScheme, language = "en", level = "ALL", showQuery = FALSE, knownSchemes = NULL, preferMappingOnly = FALSE )retrieveClassificationTable( endpoint, prefix, conceptScheme, language = "en", level = "ALL", showQuery = FALSE, knownSchemes = NULL, preferMappingOnly = FALSE )
endpoint |
Character. Repository to query. Must be either "CELLAR" or "FAO". |
prefix |
Character. Classification prefix used for matching and URI resolution (e.g. "cn2022", "cpc21", "isic4"). |
conceptScheme |
Character. Local identifier of the scheme (often identical to
|
language |
Character. Desired language for labels, scope notes and exclusion notes. Default: "en". |
level |
Character. One of:
|
showQuery |
Logical.
|
knownSchemes |
Optional. A data.frame supplying authoritative mappings of the
form |
preferMappingOnly |
Logical. If TRUE, the function never attempts SPARQL
discovery and uses only information in |
If showQuery = FALSE, a data.frame with one row per concept and columns:
Concept: full concept URI;
Code: skos:notation coerced to string;
Label: preferred or alternative label in the requested language;
Depth: xkos:depth if available;
BroaderList: broader concept URIs (pipe-separated);
BroaderCodeList: broader concept notations (pipe-separated);
IncludeNotes: skos:scopeNote values (double-pipe-separated);
ExcludeNotes: xkos:exclusionNote values (double-pipe-separated).
If showQuery = TRUE, returns a list with:
SPARQL.query – the executed SPARQL string;
scheme_uri – the resolved ConceptScheme URI;
ClassificationTable – the data.frame described above.
## Not run: # Retrieve full CN2022 classification (English) cn <- retrieveClassificationTable( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en" ) # Retrieve CPC v2.1 from FAO and inspect query out <- retrieveClassificationTable( endpoint = "FAO", prefix = "cpc21", conceptScheme = "cpc21", showQuery = TRUE ) cat(out$SPARQL.query) ## End(Not run)## Not run: # Retrieve full CN2022 classification (English) cn <- retrieveClassificationTable( endpoint = "CELLAR", prefix = "cn2022", conceptScheme = "cn2022", language = "en" ) # Retrieve CPC v2.1 from FAO and inspect query out <- retrieveClassificationTable( endpoint = "FAO", prefix = "cpc21", conceptScheme = "cpc21", showQuery = TRUE ) cat(out$SPARQL.query) ## End(Not run)
Downloads a correspondence table, a mapping of codes, from one statistical classification (A) to another (B) from an online service (CELLAR or FAO), and returns it as a plain data frame. The function has no side effects (it does not read or write files).
A correspondence table tells you which codes in classification A relate
to which codes in classification B (e.g., NACE2 → CPA21).
retrieveCorrespondenceTable( endpoint, prefix, ID_table, language = "en", showQuery = FALSE )retrieveCorrespondenceTable( endpoint, prefix, ID_table, language = "en", showQuery = FALSE )
endpoint |
Character. The online service to query. Case-insensitive.
Supported values are those returned by the internal endpoint registry
(e.g., |
prefix |
Character. Catalogue prefix where the correspondence is
published (e.g., |
ID_table |
Character. Identifier of the correspondence, typically of
the form |
language |
Character. Preferred label language as a BCP‑47 code.
Defaults to |
showQuery |
Logical. If |
When to use this. Use retrieveCorrespondenceTable() when you
know (or have discovered) the prefix and ID_table of a
correspondence and want the mapping rows as a standard data frame.
How it works (short).
Validates the selected service (endpoint).
Optionally resolves a stable anchor via correspondenceTableList().
Builds a SPARQL query returning source/target codes, labels, notes, and link URI.
Issues an HTTP request and parses the response into a data frame.
A data.frame where each row represents one mapping from the source
classification (A) to the target classification (B).
Columns (when available):
CorrespondenceID: Requested identifier (e.g., "NACE2_CPA21").
Prefix: Catalogue prefix (constant per table).
A: Source code (A parsed from ID_table).
B: Target code (B parsed from ID_table).
Label_A, Label_B: Human‑readable labels in language.
Include_A, Exclude_A, Include_B, Exclude_B: Notes, if available.
Comment: Free‑text comment on the mapping, if any.
URL: URI of the mapping association.
Attributes attached for traceability:
endpoint, endpoint_url, prefix, ID_table
A, B, language, normalized_codes, anchor_mode, anchor
The function focuses on retrieval. Any code normalization is left to downstream steps.
## Not run: # 1) Discover available correspondences at CELLAR ct <- correspondenceTableList(endpoint = "CELLAR") subset(ct, grepl("NACE2", ID) & grepl("CPA21", ID)) # 2) Retrieve one correspondence (English labels) x <- retrieveCorrespondenceTable( endpoint = "CELLAR", prefix = "nace2", ID_table = "NACE2_CPA21", language = "en" ) head(x) # 3) Save to CSV explicitly if needed (no automatic writing) # utils::write.csv(x, "NACE2_CPA21.csv", row.names = FALSE) # 4) Inspect the generated SPARQL query dbg <- retrieveCorrespondenceTable( endpoint = "FAO", prefix = "cpa21", ID_table = "NACE2_CPA21", showQuery = TRUE ) message(substr(dbg$SPARQL.query, 1, 400), "...") ## End(Not run)## Not run: # 1) Discover available correspondences at CELLAR ct <- correspondenceTableList(endpoint = "CELLAR") subset(ct, grepl("NACE2", ID) & grepl("CPA21", ID)) # 2) Retrieve one correspondence (English labels) x <- retrieveCorrespondenceTable( endpoint = "CELLAR", prefix = "nace2", ID_table = "NACE2_CPA21", language = "en" ) head(x) # 3) Save to CSV explicitly if needed (no automatic writing) # utils::write.csv(x, "NACE2_CPA21.csv", row.names = FALSE) # 4) Inspect the generated SPARQL query dbg <- retrieveCorrespondenceTable( endpoint = "FAO", prefix = "cpa21", ID_table = "NACE2_CPA21", showQuery = TRUE ) message(substr(dbg$SPARQL.query, 1, 400), "...") ## End(Not run)
Update the correspondence table between two classifications when one of them has been updated.
updateCorrespondenceTable( A, B, AStar, AB, AAStar, Reference = "none", MismatchToleranceB = 0.2, MismatchToleranceAStar = 0.2, Redundancy_trim = TRUE )updateCorrespondenceTable( A, B, AStar, AB, AAStar, Reference = "none", MismatchToleranceB = 0.2, MismatchToleranceAStar = 0.2, Redundancy_trim = TRUE )
A |
A data frame containing the original classification A. The first column holds the codes; optional subsequent columns may hold labels and supplementary info. |
B |
A data frame containing the target classification B. Same structure expectations as |
AStar |
A data frame containing the updated version A*. Same structure expectations as |
AB |
A data frame containing the previous correspondence table A:B (at least two columns: A-codes, B-codes). |
AAStar |
A data frame containing the concordance table A:A* (at least two columns: A-codes, A*-codes). |
Reference |
The reference classification among A and B. Valid values: |
MismatchToleranceB |
Maximum acceptable proportion of rows in the updated table with no code of B
among those that have a code for A and/or A*. Default |
MismatchToleranceAStar |
Maximum acceptable proportion of rows in the updated table with no code of A*
among those that have a code for A and/or B. Default |
Redundancy_trim |
Logical. If |
*Input data frame requirements*
A, B, AStar: at least 1 column and at least 1 row. The first column contains the codes; the first row is the header.
AB, AAStar: at least 2 columns and at least 1 row; the first two columns contain code pairs; the first row is the header.
Codes in the first column of A, B, AStar must be unique (no duplicates).
Pairs (first two columns) in AB and AAStar must be unique (no duplicate pairs).
If additional columns are present in classification or correspondence tables, the second column is treated as a label.
All columns are coerced to character.
*Minimum interdependencies*
At least one code of A must appear in both AAStar and AB.
At least one code of A* must appear in both AStar and AAStar.
At least one code of B must appear in both B and AB.
*Mismatch tolerance*
If the share of rows with NoMatchToAStar == 1 exceeds MismatchToleranceAStar, the function stops with an error.
If the share of rows with NoMatchToB == 1 exceeds MismatchToleranceB, the function stops with an error.
updateCorrespondenceTable() returns a list with two data frames:
updateCorrespondenceTable: the updated correspondence A*:B with flags "CodeChange", "Review" (if applicable),
"Redundancy", "NoMatchToAStar", "NoMatchToB", "NoMatchFromAStar", "NoMatchFromB", "LabelChange",
plus any additional columns coming from A, B, AStar, AB, AAStar.
classificationNames: the names of the classifications (A, B, A*) read from the first column headers.
For each row of the updated correspondence table, the value of "CodeChange" is equal to 1 if the code of A (or A*)
contained in this row maps -in this or any other row of the table- to a different code of A* (or A), otherwise the
"CodeChange" is equal to 0. The value of "CodeChange" is empty if either the code of A, or the code of A*, or both are missing.
The "Review" flag is produced only if argument Reference has been set equal to "A" or "B".
For each row of the updated correspondence table, if Reference = "A" the value of "Review" is equal to
1 if the code of B maps to more than one code of A*, and 0 otherwise. If Reference = "B" the
value of "Review" is equal to 1 if the code of A* maps to more than one code of B, and 0 otherwise. The value
of the flag is empty if either the code of A*, or the code of B, or both are missing.
For each row of the updated correspondence table, the value of "Redundancy" is equal to 1 if the row contains
a combination of codes of A* and B that also appears in at least one other row of the updated correspondence table. The
value of the flag is empty if both the code of A* and the code of B are missing.
When "Redundancy_Trim" is equal to FALSE the "Redundancy_keep" flag is created to identify with value 1
the records that will be kept if trimming is performed.
For each row of the updated correspondence table, the value of "NoMatchToAStar" is equal to 1 if there is a
code for A, for B, or for both, but no code for A*. The value of the flag is 0 if there are codes for both A and
A* (regardless of whether there is a code for B or not). Finally, the value of "NoMatchToAStar" is empty if neither A nor B
have a code in this row.
For each row of the updated correspondence table, the value of "NoMatchToB" is equal to 1 if there is a code
for A, for A*, or for both, but no code for B. The value of the flag is 0 if there are codes for both A and B
(regardless of whether there is a code for A* or not). Finally, the value of "NoMatchToB" is empty if neither A nor
A* have a code in this row.
For each row of the updated correspondence table, the value of "NoMatchFromAStar" is equal to 1 if the row
contains a code of A* that appears in the table of classification A* but not in the concordance table A:A*. The value of
the flag is 0 if the row contains a code of A* that appears in both the table of classification
A* and the concordance table A:A*. Finally, the value of the flag is empty if the row contains no code of A* or if it
contains a code of A* that appears in the concordance table A:A* but not in the table of classification A*.
For each row of the updated correspondence table, the value of "NoMatchFromB" is equal to 1 if the row
contains a code of B that appears in the table of classification B but not in the correspondence table A:B. The value of
the flag is 0 if the row contains a code of B that appears in both the table of classification B and the
correspondence table A:B. Finally, the value of the flag is empty if the row contains no code of B or if it contains a code
of B that appears in the correspondence table A:B but not in the table of classification B.
For each row of the updated correspondence table, the value of "LabelChange" is equal to 1 if the labels of
the codes of A and A* are different, and 0 if they are the same. Finally, the value of "LabelChange" is empty if
either of the labels, or both labels, are missing. Lower and upper case are considered the same, and punctuation characters
are ignored when comparing code labels.
The argument "Redundancy_trim" is used to delete all the redundancies which are mapping correctly. If the analysis
concludes that the A*code / Bcode mapping is correct for all cases involving redundancies, then an action is needed to remove
the redundancies. If the selected value is TRUE, all redundant records are removed and kept only one record for each unique
combination. For this record retained, the Acodes, the Alabel and the Asupp information is replaced with ‘multiple’. If the multiple
A records are the same, their value will not be replaced. If the selected value is FALSE, no trimming is executed so redundant
records are shown, together with the redundancy flag.
## Not run: # Read CSVs outside, pass data frames in: A_df <- utils::read.csv(system.file("extdata/test", "NAICS2017.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") AStar_df <- utils::read.csv(system.file("extdata/test", "NAICS2022.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") B_df <- utils::read.csv(system.file("extdata/test", "NACE.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") AB_df <- utils::read.csv(system.file("extdata/test", "NAICS2017_NACE.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") AAStar_df <- utils::read.csv(system.file("extdata/test", "NAICS2017_NAICS2022.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") UPC <- updateCorrespondenceTable( A = A_df, B = B_df, AStar = AStar_df, AB = AB_df, AAStar = AAStar_df, Reference = "none", MismatchToleranceB = 0.5, MismatchToleranceAStar = 0.3, Redundancy_trim = FALSE ) summary(UPC) head(UPC$updateCorrespondenceTable) UPC$classificationNames ## End(Not run)## Not run: # Read CSVs outside, pass data frames in: A_df <- utils::read.csv(system.file("extdata/test", "NAICS2017.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") AStar_df <- utils::read.csv(system.file("extdata/test", "NAICS2022.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") B_df <- utils::read.csv(system.file("extdata/test", "NACE.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") AB_df <- utils::read.csv(system.file("extdata/test", "NAICS2017_NACE.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") AAStar_df <- utils::read.csv(system.file("extdata/test", "NAICS2017_NAICS2022.csv", package = "correspondenceTables"), sep = ",", header = TRUE, check.names = FALSE, colClasses = "character", encoding = "UTF-8") UPC <- updateCorrespondenceTable( A = A_df, B = B_df, AStar = AStar_df, AB = AB_df, AAStar = AAStar_df, Reference = "none", MismatchToleranceB = 0.5, MismatchToleranceAStar = 0.3, Redundancy_trim = FALSE ) summary(UPC) head(UPC$updateCorrespondenceTable) UPC$classificationNames ## End(Not run)