Title: | Creating Correspondence Tables Between Two Statistical Classifications |
---|---|
Description: | A candidate correspondence table between two classifications can be created when there are correspondence tables leading from the first classification to the second one via intermediate 'pivot' classifications. The correspondence table between two statistical classifications can be updated when one of the classifications gets updated to a new version. |
Authors: | Vasilis Chasiotis [aut] (Department of Statistics, Athens University of Economics and Business), Photis Stavropoulos [aut] (Quantos S.A. Statistics and Information Systems), Martin Karlberg [aut], Mátyás Mészáros [cre], Martina Patone [aut], Erkand Muraku [aut], Clement Thomas [aut], Loic Bienvenue [aut] |
Maintainer: | Mátyás Mészáros <[email protected]> |
License: | EUPL |
Version: | 0.8.2 |
Built: | 2024-11-01 03:55:36 UTC |
Source: | https://github.com/eurostat/correspondencetables |
Retrieve a list of classification tables in CELLAR, FAO or both.
classEndpoint(endpoint)
classEndpoint(endpoint)
endpoint |
A string of type character containing the endpoint where the table is stored.
The valid values are |
classEndpoint()
returns a table with information needed to retrieve the classification table:
Prefix name: the SPARQL instruction for a declaration of a namespace prefix
Conceptscheme: taxonomy of the SKOS object to be retrieved
URI: the URL from which the SPARQL query was retrieved
Name: the name of the table retrieved
{ endpoint = "ALL" list_data = classEndpoint(endpoint) }
{ endpoint = "ALL" list_data = classEndpoint(endpoint) }
The purpose of this function is to provide a comprehensive summary of the data structure for each classification in CELLAR and FAO endpoint. The summary includes information such as the prefix name, URI, key, concept scheme, and title associated with each classification.
classificationEndpoint(endpoint = "ALL")
classificationEndpoint(endpoint = "ALL")
endpoint |
SPARQL endpoints provide a standardized way to access data sets,
making it easier to retrieve specific information or perform complex queries on linked data. This is an optional
parameter, which by default is set to |
classificationEndpoint()
returns a table with information needed to retrieve the classification table:
Prefix name: the SPARQL instruction for a declaration of a namespace prefix
Conceptscheme: taxonomy of the SKOS object to be retrieved
URI: the URL from which the SPARQL query was retrieved
Name: the name of the table retrieved
{ endpoint = "ALL" list_data = classificationEndpoint(endpoint) }
{ endpoint = "ALL" list_data = classificationEndpoint(endpoint) }
The purpose of this function perform quality control checks on statistical classifications. It checks the compliance of classifications with structural rules and provides informative error messages for violations. The function requires input files containing code and label information for each classification position. It verifies the formatting requirements, uniqueness of codes, fullness of hierarchy, uniqueness of labels, hierarchical label dependencies, single child code compliance, and sequencing of codes. The function generates a QC output data frame with the classification data, hierarchical level, code segments, and test outcomes.Additionally, it allows exporting the output to a CSV file. Overall, the classificationQC function ensures the integrity and accuracy of statistical classifications.
classificationQC( classification, lengthsFile, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL, XLSXout = FALSE )
classificationQC( classification, lengthsFile, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL, XLSXout = FALSE )
classification |
Refers to a classification in csv file or an R dataframe structured with two columns, consisting
of codes and labels, respectively. If the classification is provided as a csv file, it should be stored in the working directory (as
defined using |
fullHierarchy |
It is used to test the fullness of hierarchy. If the parameter |
labelUniqueness |
It is used to test the that positions at the same hierarchical level have unique labels. If set to |
labelHierarchy |
It is used to ensure that hierarchical structure of labels is respected.
When set to |
singleChildCode |
It refers to CSV file with specific formatting to define valid codes for each level. If this parameter is not |
sequencing |
It refers to a CSV file to define the admissible codes for multiple children at each level. If this parameter
is not |
XLSXout |
The valid values are |
lengthsfile |
Refers to a CSV file or a R dataframe (one record per hierarchical level) containing the initial and last position of the segment of the code specific to that level. The number of lines of this CSV file or the R dataframe will also implicitly define the number of hierarchical levels of the classification. This is a mandatory argument. |
classificationQC()
returns a list of dataframes identifying possible the cases violating the formatting requirements. The
databases returned depend on the rules checked. The databases produced are:
QC_output The dataset includes all the original records in the classification. Colum "Level" refers to the hierarchical levels of each position. Each code will be parsed into segment_k (column "Segmentk") and code_k (column "Codek"), corresponding to the code and segment and hierarchical level k respectively. Additional columns are included to flag the corrected behaviour in each position. These are
Orphan: if fullHierarchy is set to FALSE, an "orphan" is a position at a hierarchical level (j) greater than 1 that lacks a parent at the hierarchical level (j-1) immediately above it. Orphan positions are marked with a value of 1 in the "QC output" column, indicating their orphan status. Otherwise, they are assigned a value of 0.
Childless: if fullHierarchy is set to TRUE, a "childless" position is one at a hierarchical level (j) less than k that lacks a child at the hierarchical level (j+1) immediately below it. Childless positions are marked with a value of 1 indicating their childless status. Otherwise, they are assigned a value of 0.
DuplicateLabel: new column in the output that flags positions involved in duplicate label situations (where multiple positions share the same label at the same hierarchical level) by assigning them a value of 1, while positions with unique labels are assigned a value of 0.
SingleChildMismatch: column in the output provides information about label hierarchy consistency in a hierarchical classification system. It indicates:c Value 1: Mismatched labels between a parent and its single child. Value 9: Parent-child pairs with matching labels, but the parent has multiple children.
SingleCodeError: column serves as a flag indicating whether a position is a single child and whether the corresponding "singleCode" contains the level j segment. A value of 1 signifies a mismatch, while a value of 0 indicates compliance with the coding rules
MultipleCodeError: column serves as a flag indicating whether a position is not a single child and whether the corresponding "multipleCodej" contains the level j segment. A value of 1 signifies a mismatch, while a value of 0 indicates compliance with the coding rules
GapBefore: takes the value 0 or 1 if there is a missing child in the 123456789 series.
LastSibling: takes the value 1 when it is the last child in the series 123456789 otherwise the value 0
QC_noLevels A subset of the QC_output dataframe including only records for which levels is not defined. In general if this dataframe is not empty, it suggest that either the classification or the length file is not correctily specified.
QC_orphan A subset of the QC_output dataframe including only records that have no parents at the higher hierarchical level.
QC_childless A subset of the QC_output dataframe including only records that have no children at the lower hierarchical level.
QC_duplicatesLabel A subset of the QC_output dataframe including only records that have duplicated label in the same hierarchical level.
QC_duplicatesCode A subset of the QC_output dataframe including only records that have the same codes.
QC_singleChildMismatch A subset of the QC_output dataframe including only records that are single child and have different labels from their parents or that are multiple children and have same labels to their parents.
QC_singleCodeError A subset of the QC_output dataframe including only records that are single children and have been wrongly coded (not following the rule provided in the 'SingleChildMismatch' CSV file).
QC_multipleCodeError A subset of the QC_output dataframe including only records that are multiple children and have been wrongly coded (not following the rule provided in the 'SingleChildMismatch' CSV file).
QC_gapBefore A subset of the QC_output dataframe including only records that are multiple children and have gap before in the sequencing provided in the 'sequencing' CSV file.
QC_lastSibling A subset of the QC_output dataframe including only records that are multiple and last children following the sequencing provided in the 'sequencing' CSV file.
{ prefix = "nace2" conceptScheme = "nace2" endpoint = "CELLAR" lengthsTable = lengthsFile(endpoint, prefix, conceptScheme, correction = TRUE) classification = retrieveClassificationTable(prefix, endpoint, conceptScheme, level="ALL")$ClassificationTable classification = classification[,c(1,2)] classification = correctionClassification(classification) Output = classificationQC(classification, lengthsFile, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL) View(Output$QC_output) View(Output$QC_noLevels) View(Output$QC_orphan) View(Output$QC_childless) View(Output$QC_duplicatesLabel) View(Output$QC_duplicatesCode) View(Output$QC_singleChildMismatch) View(Output$QC_singleCodeError) View(Output$QC_multipleCodeError) View(Output$QC_gapBefore) View(Output$QC_lastSibling) }
{ prefix = "nace2" conceptScheme = "nace2" endpoint = "CELLAR" lengthsTable = lengthsFile(endpoint, prefix, conceptScheme, correction = TRUE) classification = retrieveClassificationTable(prefix, endpoint, conceptScheme, level="ALL")$ClassificationTable classification = classification[,c(1,2)] classification = correctionClassification(classification) Output = classificationQC(classification, lengthsFile, fullHierarchy = TRUE, labelUniqueness = TRUE, labelHierarchy = TRUE, singleChildCode = NULL, sequencing = NULL) View(Output$QC_output) View(Output$QC_noLevels) View(Output$QC_orphan) View(Output$QC_childless) View(Output$QC_duplicatesLabel) View(Output$QC_duplicatesCode) View(Output$QC_singleChildMismatch) View(Output$QC_singleCodeError) View(Output$QC_multipleCodeError) View(Output$QC_gapBefore) View(Output$QC_lastSibling) }
The aim of this function is to provide a table showing the different codes and labels for each classification
correctionClassification(classification)
correctionClassification(classification)
classification |
it returns a dataframe with two columns corrected according to the classification of CELLAR & FAO. |
correctionClassification()
returns a table with information needed to retrieve the classification table:
Classification Code name (e.g. nace2): the code of each object
Classification Label: corresponding name of each object
{ prefix = "nace2" conceptScheme = "nace2" endpoint = "CELLAR" classification = retrieveClassificationTable(prefix, endpoint, conceptScheme, level="ALL")$ClassificationTable correct_classification = correctionClassification(classification) View(correct_classification) }
{ prefix = "nace2" conceptScheme = "nace2" endpoint = "CELLAR" classification = retrieveClassificationTable(prefix, endpoint, conceptScheme, level="ALL")$ClassificationTable correct_classification = correctionClassification(classification) View(correct_classification) }
provides an overview of all the available correspondence classification from CELLAR and FAO repository.
correspondenceList(endpoint)
correspondenceList(endpoint)
endpoint |
The SPARQL Endpoint. The valid values are |
correspondenceList()
returns a list of the correspondence tables available with prefix name, ID, Source classification,
Target classification, Table name and URI.
{ corr_list = correspondenceList("ALL") }
{ corr_list = correspondenceList("ALL") }
Retrieve information, for all the classification available in the repositories (CELLAR and FAO), about the level names their hierarchy and the numbers of records the function "structureData()" can be used.
dataStructure(prefix, conceptScheme, endpoint, language = "en")
dataStructure(prefix, conceptScheme, endpoint, language = "en")
prefix |
Prefixes are typically defined at the beginning of a SPARQL query and are used throughout the query to make it more concise and easier to read. Multiple prefixes can be defined in a single query to cover different namespaces used in the data set. The function 'classificationEndpoint()' can be used to generate the prefixes for the selected classification table. |
conceptScheme |
Refers to a unique identifier associated to specific classification table. The conceptScheme can be obtained by utilizing the "classificationEndpoint()" function. |
endpoint |
SPARQL endpoints provide a standardized way to access data sets,
making it easier to retrieve specific information or perform complex queries on linked data.
The valid values are |
language |
Refers to the specific language used for providing label, include and exclude information in the selected classification table. By default is set to "en". This is an optional argument. |
structureData()
returns the structure of a classification table from CELLAR and FAO in form a table with the following colums:
Concept_Scheme: taxonomy of the SKOS object to be retrieved
Level: the levels of the objects in the collection
Depth: identify the hierarchy of each level
Count: the number of objects retrieved in each level
{ ## Obtain a list including the structure of each classification available ## CELLAR data_CELLAR = list() endpoint = "CELLAR" list_data = classificationEndpoint("ALL") for (i in 1:nrow(list_data$CELLAR)){ prefix = list_data$CELLAR[i,1] conceptScheme = list_data$CELLAR[i,2] data_CELLAR[[i]] = dataStructure(prefix, conceptScheme, endpoint) } names(data_CELLAR) = list_data$CELLAR[,1] ## FAO data_FAO = list() endpoint = "FAO" for (i in 1:nrow(list_data$FAO)){ prefix = list_data$FAO[i,1] conceptScheme = list_data$FAO[i,2] data_FAO[[i]] = dataStructure(prefix, conceptScheme, endpoint) } names(data_FAO) = list_data$FAO[,1] }
{ ## Obtain a list including the structure of each classification available ## CELLAR data_CELLAR = list() endpoint = "CELLAR" list_data = classificationEndpoint("ALL") for (i in 1:nrow(list_data$CELLAR)){ prefix = list_data$CELLAR[i,1] conceptScheme = list_data$CELLAR[i,2] data_CELLAR[[i]] = dataStructure(prefix, conceptScheme, endpoint) } names(data_CELLAR) = list_data$CELLAR[,1] ## FAO data_FAO = list() endpoint = "FAO" for (i in 1:nrow(list_data$FAO)){ prefix = list_data$FAO[i,1] conceptScheme = list_data$FAO[i,2] data_FAO[[i]] = dataStructure(prefix, conceptScheme, endpoint) } names(data_FAO) = list_data$FAO[,1] }
The aim of this function is to provide a table showing the different levels of hierarchy for each classification and the length of each level.
lengthsFile(endpoint, prefix, conceptScheme, correction = TRUE)
lengthsFile(endpoint, prefix, conceptScheme, correction = TRUE)
endpoint |
SPARQL endpoints provide a standardized way to access data sets,
making it easier to retrieve specific information or perform complex queries on linked data.
The valid values are |
prefix |
Prefixes are typically defined at the beginning of a SPARQL query and are used throughout the query to make it more concise and easier to read. Multiple prefixes can be defined in a single query to cover different namespaces used in the dataset. The function 'classEndpoint()' can be used to generate the prefixes for the selected correspondence table. |
conceptScheme |
Refers to a unique identifier associated to specific classification table. The conceptScheme can be obtained by utilizing the "classEndpoint()" function. |
correction |
The valid values are |
lenghtsFile()
returns a table containing the lengths for each hierarchical level of the classification.
charb: contains the length for each code for each hierarchical level
chare: contains the concatenated length of char b for each code for each hierarchical level
{ endpoint = "CELLAR" prefix = "nace2" conceptScheme = "nace2" lengthsTable = lengthsFile(endpoint, prefix, conceptScheme, correction = TRUE) #View lengthsTable View(lengthsTable) }
{ endpoint = "CELLAR" prefix = "nace2" conceptScheme = "nace2" lengthsTable = lengthsFile(endpoint, prefix, conceptScheme, correction = TRUE) #View lengthsTable View(lengthsTable) }
Creation of a candidate correspondence table between two classifications, A and B, when there are
correspondence tables leading from the first classification to the second one via intermediate pivot
classifications
.
The correspondence tables leading from A to B are A:
, {
:
:
}, B:
.
newCorrespondenceTable( Tables, CSVout = NULL, Reference = "none", MismatchTolerance = 0.2, Redundancy_trim = TRUE )
newCorrespondenceTable( Tables, CSVout = NULL, Reference = "none", MismatchTolerance = 0.2, Redundancy_trim = TRUE )
Tables |
A string of type character containing the name of a csv file which contains the names of the files that contain the classifications and the intermediate correspondence tables (see "Details" below). |
CSVout |
The preferred name for the output csv files that will contain the candidate correspondence table
and information about the classifications involved. The valid values are |
Reference |
The reference classification among A and B. If a classification is the reference to the other, and hence
hierarchically superior to it, each code of the other classification is expected to be mapped to at most one code
of the reference classification. The valid values are |
MismatchTolerance |
The maximum acceptable proportion of rows in the candidate correspondence table which contain
no code for classification A or no code for classification B. The default value is |
Redundancy_trim |
An argument in the function containing the logical values |
File and file name requirements:
The file that corresponds to argument Tables
and the files to which the contents of Tables
lead, must be in csv format with comma as delimiter. If full paths are not provided, then these files must
be available in the working directory. No two filenames provided must be identical.
The file that corresponds to argument Tables
must contain filenames, and nothing else, in
a ×
table, where
, a positive integer, is the number of "pivot" classifications.
The cells in the main diagonal of the table provide the filenames of the files which contain, with this order,
the classifications A,
,
,
and B. The off-diagonal directly above the main
diagonal contains the filenames of the files that contain, with this order, the correspondence tables
A:
, {
:
,
} and B:
. All other cells of the table
must be empty.
If any of the two files where the output will be stored is read protected (for instance because it is open elsewhere) an error message will be reported and execution will be halted.
Classification table requirements:
Each of the files that contain classifications must contain at least one column and at least two rows. The first column contains the codes of the respective classification. The first row contains column headers. The header of the first column is the name of the respective classification (e.g., "CN 2021").
The classification codes contained in a classification file (expected in its first column as mentioned above) must be unique. No two identical codes are allowed in the column.
If any of the files that contain classifications has additional columns the first one of them is assumed to contain the labels of the respective classification codes.
Correspondence table requirements:
The files that contain correspondence tables must contain at least two columns and at least two rows.
The first column of the file that contains A: contains the codes of classification A. The second column
contains the codes of classification
. Similar requirements apply to the files that contain
:
,
and B:
. The first row of each of the files that contain
correspondence tables contains column headers. The names of the first two columns are the names of the respective
classifications.
The pairs of classification codes contained in a correspondence table file (expected in its first two columns as mentioned above) must be unique. No two identical pairs of codes are allowed in the first two columns.
Interdependency requirements:
At least one code of classification A must appear in both the file of classification A and the file of
correspondence table A:.
At least one code of classification B must appear in both the file of classification B and the file of
correspondence table B:, where
,
, is the number of pivot classifications.
If there is only one pivot classification, , at least one code of it must appear in both the file of
correspondence table A:
and the file of correspondence table B:
.
If the pivot classifications are with
then at least one code of
must appear in
both the file of correspondence table A:
and the file of correspondence table
:
, at least one
code of each of the
,
(if
) must appear in both the file of correspondence
table
:
and the file of correspondence table
:
, and at least one code of
must appear in both the file of correspondence table
:
and the file of correspondence table
B:
.
Mismatch tolerance:
The ratio that is compared with MismatchTolerance
has as numerator the number of rows in the candidate
correspondence table which contain no code for classification A or no code for classification B and as denominator
the total number of rows of this table. If the ratio exceeds MismatchTolerance
the execution of the function
is halted.
If any of the conditions required from the arguments is violated an error message is produced and execution is stopped.
newCorrespondenceTable()
returns a list with two elements, both of which are data frames.
The first element is the candidate correspondence table A:B, including the codes of all "pivot" classifications, augmented with flags "Review" (if applicable), "Redundancy", "Unmatched", "NoMatchFromA", "NoMatchFromB" and with all the additional columns of the classification and intermediate correspondence table files.
The second element contains the names of classification A, the "pivot" classifications and classification B as read from the top left-hand side cell of the respective input files.
If the value of argument CSVout
a string of type character
, the elements of the list are exported
into files of csv format. The name of the file for the first element is the value of argument CSVout
and the
name of the file for the second element is classificationNames_CSVout
. For example, if
CSVout
= "newCorrespondenceTable.csv", the elements of the list are exported into "newCorrespondenceTable.csv"
and "classificationNames_newCorrespondenceTable.csv" respectively.
The "Review" flag is produced only if argument Reference has been set equal to "A
" or "B
". For
each row of the candidate correspondence table, if Reference
= "A
" the value of "Review" is equal to
1
if the code of B maps to more than one code of A, and 0
otherwise. If Reference
= "B
"
the value of "Review" is equal to 1
if the code of A maps to more than one code of B, and 0
otherwise.
The value of the flag is empty if the row does not contain a code of A or a code of B.
For each row of the candidate correspondence table, the value of "Redundancy" is equal to 1
if the row
contains a combination of codes of A and B that also appears in at least one other row of the candidate
correspondence table.
When "Redundancy_Trim" is equal to FALSE
the "Redundancy_keep" flag is created to identify with value 1
the records that will be kept if trimming is performed.
For each row of the candidate correspondence table, the value of "Unmatched" is equal to 1
if the row
contains a code of A but no code of B or if it contains a code of B but no code of A. The value of the flag is
0
if the row contains codes for both A and B.
For each row of the candidate correspondence table, the value of "NoMatchFromA" is equal to 1
if the row
contains a code of A that appears in the table of classification A but not in correspondence table A:. The
value of the flag is
0
if the row contains a code of A that appears in both the table of classification A and
correspondencetable A:. Finally, the value of the flag is empty if the row contains no code of A or if it
contains a code of A that appears in correspondence table A:
but not in the table of classification A.
For each row of the candidate correspondence table, the value of "NoMatchFromB" is equal to 1
if the row
contains a code of B that appears in the table of classification B but not in correspondence table B:. The
value of the flag is
0
if the row contains a code of B that appears in both the table of classification B and
correspondence table B:. Finally, the value of the flag is empty if the row contains no code of B or if it
contains a code of B that appears in correspondence table B:
but not in the table of classification B.
The argument "Redundancy_trim" is used to delete all the redundancies which are mapping correctly.
The valid logical values for this argument in the candidate correspondence table are TRUE
or FALSE
.
If the selected value is TRUE
, all redundant records are removed and kept exactly one record for each unique combination.
For this retained record, the codes, the label and the supplementary information of the pivot classifications are replaced with
'multiple'. If the multiple infomration of the pivot classifications are the same, their value will not be replaced.
If the selected value is FALSE
, no trimming is executed so redundant records are shown, together with the redundancy flag.
If the logical values are missing the implementation of the function will stop.
Running browseVignettes("correspondenceTables")
in the console opens an html page in the user's default browser. Selecting HTML from the menu, users can read information about the use of the sample datasets that are included in the package.
If they wish to access the csv files with the sample data, users have two options:
Option 1: Unpack into any folder of their choice the tar.gz file into which the package has arrived. All sample datasets may be found in the "inst/extdata" subfolder of this folder.
Option 2: Go to the "extdata" subfolder of the folder in which the package has been installed in their PC's R
library. All sample datasets may be found there.
{ ## Application of function newCorrespondenceTable() with "example.csv" being the file ## that includes the names the files and the intermediate tables in a sparse square ## matrix containing the 100 rows of the classifications (from ISIC v4 to CPA v2.1 through ## CPC v2.1). The desired name for the csv file that will contain the candidate ## correspondence table is "newCorrespondenceTable.csv", the reference classification is ## ISIC v4 ("A") and the maximum acceptable proportion of unmatched codes between ## ISIC v4 and CPC v2.1 is 0.56 (this is the minimum mismatch tolerance for the first 100 row ## as 55.5% of the code of ISIC v4 is unmatched). tmp_dir<-tempdir() A <- read.csv(system.file("extdata", "example.csv", package = "correspondenceTables"), header = FALSE, sep = ",") for (i in 1:nrow(A)) { for (j in 1:ncol(A)) { if (A[i,j]!="") { A[i, j] <- system.file("extdata", A[i, j], package = "correspondenceTables") }}} write.table(x = A, file = file.path(tmp_dir,"example.csv"), row.names = FALSE, col.names = FALSE, sep = ",") NCT<-newCorrespondenceTable(file.path(tmp_dir,"example.csv"), file.path(tmp_dir,"newCorrespondenceTable.csv"), "A", 0.56, FALSE) summary(NCT) head(NCT$newCorrespondenceTable) NCT$classificationNames csv_files<-list.files(tmp_dir, pattern = ".csv") unlink(csv_files) }
{ ## Application of function newCorrespondenceTable() with "example.csv" being the file ## that includes the names the files and the intermediate tables in a sparse square ## matrix containing the 100 rows of the classifications (from ISIC v4 to CPA v2.1 through ## CPC v2.1). The desired name for the csv file that will contain the candidate ## correspondence table is "newCorrespondenceTable.csv", the reference classification is ## ISIC v4 ("A") and the maximum acceptable proportion of unmatched codes between ## ISIC v4 and CPC v2.1 is 0.56 (this is the minimum mismatch tolerance for the first 100 row ## as 55.5% of the code of ISIC v4 is unmatched). tmp_dir<-tempdir() A <- read.csv(system.file("extdata", "example.csv", package = "correspondenceTables"), header = FALSE, sep = ",") for (i in 1:nrow(A)) { for (j in 1:ncol(A)) { if (A[i,j]!="") { A[i, j] <- system.file("extdata", A[i, j], package = "correspondenceTables") }}} write.table(x = A, file = file.path(tmp_dir,"example.csv"), row.names = FALSE, col.names = FALSE, sep = ",") NCT<-newCorrespondenceTable(file.path(tmp_dir,"example.csv"), file.path(tmp_dir,"newCorrespondenceTable.csv"), "A", 0.56, FALSE) summary(NCT) head(NCT$newCorrespondenceTable) NCT$classificationNames csv_files<-list.files(tmp_dir, pattern = ".csv") unlink(csv_files) }
Create a list of prefixes to be used when defying the SPARQL query to retrieve the tables
prefixList(endpoint)
prefixList(endpoint)
endpoint |
A string of type character containing the endpoint where the table is stored.
The valid values are |
prefixList()
returns a list of prefixes to be used when defying the SPARQL query.
{ endpoint = "CELLAR" prefix_list = prefixList(endpoint) }
{ endpoint = "CELLAR" prefix_list = prefixList(endpoint) }
Retrieve a classification tables from CELLAR and FAO
retrieveClassificationTable( prefix, endpoint, conceptScheme, level = "ALL", language = "en", CSVout = FALSE )
retrieveClassificationTable( prefix, endpoint, conceptScheme, level = "ALL", language = "en", CSVout = FALSE )
prefix |
The SPARQL instruction for a declaration of a namespace prefix. It can be found using the classEndpoint() function. |
endpoint |
The SPARQL Endpoint, the valid values are |
conceptScheme |
Taxonomy of the SKOS object to be retrieved. It can be found using the classEndpoint() function. |
level |
The levels of the objects in the collection to be retrieved, it can be found using the structureData() function.
By default is set to |
language |
Language of the table. By default is set to |
CSVout |
The valid values are |
retrieveClassificationTable()
returns a classification tables from CELLAR and FAO. The table includes the following variables:
Classification name (e.g. nace2): the code of each object
NAME: the corresponding name of each object
Include: details on each object
Include_Also: details on each object
Exclude: details on each object
URL: the URL from which the SPARQL query was retrieved
{ prefix = "nace2" endpoint = "CELLAR" conceptScheme = "nace2" dt = retrieveClassificationTable(prefix, endpoint, conceptScheme) # By default retrieved all levels and only English head(dt) }
{ prefix = "nace2" endpoint = "CELLAR" conceptScheme = "nace2" dt = retrieveClassificationTable(prefix, endpoint, conceptScheme) # By default retrieved all levels and only English head(dt) }
Retrieve a correspondence tables from CELLAR and FAO.
retrieveCorrespondenceTable( prefix, endpoint, ID_table, language = "en", CSVout = FALSE )
retrieveCorrespondenceTable( prefix, endpoint, ID_table, language = "en", CSVout = FALSE )
prefix |
The SPARQL instruction for a declaration of a namespace prefix. It can be found using the classEndpoint() function. |
endpoint |
The SPARQL Endpoint, the valid values are |
ID_table |
The ID of the correspondence table which can be found using the correspondenceList() function. |
language |
Language of the table. By default is set to "en". This is an optional argument. |
CSVout |
The valid values are |
retrieveCorrespondenceTable()
returns a classification tables from CELLAR and FAO. The table includes the following variables:
Source Classification name (e.g. cn2019): the code of each object in the source classification
Source Classification label: the corresponding label of each object
Target Classification name (e.g. cn2021): the code of each object in the target classification
Target Classification label: the corresponding label of each object
Comment: details on each object, if available
URL: the URL from which the SPARQL query was retrieved
{ endpoint = "CELLAR" prefix = "nace2" ID_table = "NACE2_PRODCOM2021" language = "fr" dt = retrieveCorrespondenceTable(prefix, endpoint, ID_table, language) head(dt) }
{ endpoint = "CELLAR" prefix = "nace2" ID_table = "NACE2_PRODCOM2021" language = "fr" dt = retrieveCorrespondenceTable(prefix, endpoint, ID_table, language) head(dt) }
Obtain the structure of the classification tables from CELLAR and FAO.
structureData(prefix, conceptScheme, endpoint, language = "en")
structureData(prefix, conceptScheme, endpoint, language = "en")
prefix |
The SPARQL instruction for a declaration of a namespace prefix. It can be found using the classEndpoint() function. |
conceptScheme |
Taxonomy of the SKOS object to be retrieved. It can be found using the classEndpoint() function. |
endpoint |
The SPARQL Endpoint |
language |
Language of the table. By default is set to |
structureData()
returns the structure of a classification table from CELLAR and FAO in form a table with the following colums:
Concept_Scheme: taxonomy of the SKOS object to be retrieved
Level: the levels of the objects in the collection
Depth: identify the hierarchy of each level
Count: the number of objects retrieved in each level
{ endpoint = "CELLAR" prefix = "nace2" conceptScheme = "nace2" language = "en" structure_dt = structureData(prefix, conceptScheme, endpoint, language) }
{ endpoint = "CELLAR" prefix = "nace2" conceptScheme = "nace2" language = "en" structure_dt = structureData(prefix, conceptScheme, endpoint, language) }
Update the correspondence table between statistical classifications A and B when A has been updated to version A*.
updateCorrespondenceTable( A, B, AStar, AB, AAStar, CSVout = NULL, Reference = "none", MismatchToleranceB = 0.2, MismatchToleranceAStar = 0.2, Redundancy_trim = TRUE )
updateCorrespondenceTable( A, B, AStar, AB, AAStar, CSVout = NULL, Reference = "none", MismatchToleranceB = 0.2, MismatchToleranceAStar = 0.2, Redundancy_trim = TRUE )
A |
A string of the type |
B |
A string of the type |
AStar |
A string of the type |
AB |
A string of the type |
AAStar |
A string of the type character containing the name of a csv file that contains the concordance table A:A*, which contains the mapping between the codes of the two versions of the classification. |
CSVout |
The preferred name for the output csv files that will contain the updated correspondence table and
information about the classifications involved. The valid values are |
Reference |
The reference classification among A and B. If a classification is the reference to the other, and
hence hierarchically superior to it, each code of the other classification is expected to be mapped to at most one
code of the reference classification. The valid values are |
MismatchToleranceB |
The maximum acceptable proportion of rows in the updated correspondence table which contain no
code of the target classification B, among those which contain a code of A, of A*, or of both. The default value
is |
MismatchToleranceAStar |
The maximum acceptable proportion of rows in the updated correspondence table which contain
no code of the updated classification A*, among those which contain a code of A, of B, or of both. The default value
is |
Redundancy_trim |
An argument used to facilitate the trimming of the redundant records. The valid logical values are |
File and file name requirements:
The files that correspond to arguments A
, B
, AStar
, AB
, AAStar
must be
in csv format with comma as delimiter. If full paths are not provided, then these files must be available
in the working directory. No two filenames provided must be identical.
If any of the two files where the output will be stored is read protected (for instance because it is open elsewhere) an error message will be reported and execution will be halted.
Classification table requirements:
The files that correspond to arguments A
, B
and AStar
must contain at least one column
and at least two rows. The first column contains the codes of the respective classification. The first row contains
column headers. The name of the first column is the name of the respective classification (e.g., "CN 2021").
The classification codes contained in a classification file (expected in its first column as mentioned above) must be unique. No two identical codes are allowed in the column.
If any of the files that correspond to arguments A
, B
and AStar
has additional columns
the first one of them is considered as containing the labels of the respective classification codes.
Correspondence and concordance table requirements:
The files that correspond to arguments AB
and AAStar
must contain at least two columns and at least
two rows. The first column of the file that corresponds to AB
contains the codes of classification A. The second
column contains the codes of classification B. Similar requirements apply to the file that corresponds to AAStar
.
The first row of each of these files contains column headers. The names of the first two columns are the names of the
respective classifications.
The pairs of classification codes contained in the concordance and the correspondence table files (expected in their first two columns as mentioned above) must be unique. No two identical pairs of codes are allowed in the first two columns.
Interdependency requirements:
At least one code of classification A must appear in both the file of concordance table A:A* and the file of correspondence table A:B.
At least one code of classification A* must appear in both the file of classification A* and the file of concordance table A:A*.
At least one code of classification B must appear in both the file of classification B and the file of correspondence table A:B.
Mismatch tolerance:
The ratio that is compared with MismatchToleranceB
has as numerator the number of rows of the updated
correspondence table which contain a code for A, for A*, or for both, but no code for B and as denominator the number of
rows which contain a code for A, for A*, or for both (regardless of whether there is a code for B or not). If the ratio
exceeds MismatchToleranceB
the execution of the function is halted.
The ratio that is compared with MismatchToleranceAStar
has as numerator the number of rows of the updated
correspondence table which contain a code for A, for B, or for both, but no code for A* and as denominator the number of
rows which contain a code for A, for B*, or for both (regardless of whether there is a code
for A* or not). If the ratio exceeds MismatchToleranceAStar
the execution of the function is halted.
updateCorrespondenceTable()
returns a list with two elements, both of which are data frames.
The first element is the updated correspondence table A*:B augmented with flags "CodeChange", "Review" (if
applicable), "Redundancy", "NoMatchToAStar", "NoMatchToB", "NoMatchFromAStar", "NoMatchFromB", "LabelChange", and
with all the additional columns of the A
, B
, AStar
, AB
and AAStar
files.
The second element contains the names of the original classification A, the target classification B, and the updated version A*, as read from the top left-hand side cell of the respective input files.
If the value of argument CSVout
is a string of type character
, the elements of the list are
exported into files of csv format. The name of the file for the first element is the value of argument CSVout
and the name of the file for the second element is classificationNames_CSVout
. For example, if
CSVout
= "updateCorrespondenceTable.csv", the elements of the list are exported into
"updateCorrespondenceTable.csv" and "classificationNames_updateCorrespondenceTable.csv", respectively.
For each row of the updated correspondence table, the value of "CodeChange" is equal to 1
if the code of A (or A*)
contained in this row maps -in this or any other row of the table- to a different code of A* (or A), otherwise the
"CodeChange" is equal to 0
. The value of "CodeChange" is empty if either the code of A, or the code of A*, or both are missing.
The "Review" flag is produced only if argument Reference
has been set equal to "A
" or "B
".
For each row of the updated correspondence table, if Reference
= "A
" the value of "Review" is equal to
1
if the code of B maps to more than one code of A*, and 0
otherwise. If Reference
= "B
" the
value of "Review" is equal to 1
if the code of A* maps to more than one code of B, and 0
otherwise. The value
of the flag is empty if either the code of A*, or the code of B, or both are missing.
For each row of the updated correspondence table, the value of "Redundancy" is equal to 1
if the row contains
a combination of codes of A* and B that also appears in at least one other row of the updated correspondence table. The
value of the flag is empty if both the code of A* and the code of B are missing.
When "Redundancy_Trim" is equal to FALSE
the "Redundancy_keep" flag is created to identify with value 1
the records that will be kept if trimming is performed.
For each row of the updated correspondence table, the value of "NoMatchToAStar" is equal to 1
if there is a
code for A, for B, or for both, but no code for A*. The value of the flag is 0
if there are codes for both A and
A* (regardless of whether there is a code for B or not). Finally, the value of "NoMatchToAStar" is empty if neither A nor B
have a code in this row.
For each row of the updated correspondence table, the value of "NoMatchToB" is equal to 1
if there is a code
for A, for A*, or for both, but no code for B. The value of the flag is 0
if there are codes for both A and B
(regardless of whether there is a code for A* or not). Finally, the value of "NoMatchToB" is empty if neither A nor
A* have a code in this row.
For each row of the updated correspondence table, the value of "NoMatchFromAStar" is equal to 1
if the row
contains a code of A* that appears in the table of classification A* but not in the concordance table A:A*. The value of
the flag is 0
if the row contains a code of A* that appears in both the table of classification
A* and the concordance table A:A*. Finally, the value of the flag is empty if the row contains no code of A* or if it
contains a code of A* that appears in the concordance table A:A* but not in the table of classification A*.
For each row of the updated correspondence table, the value of "NoMatchFromB" is equal to 1
if the row
contains a code of B that appears in the table of classification B but not in the correspondence table A:B. The value of
the flag is 0
if the row contains a code of B that appears in both the table of classification B and the
correspondence table A:B. Finally, the value of the flag is empty if the row contains no code of B or if it contains a code
of B that appears in the correspondence table A:B but not in the table of classification B.
For each row of the updated correspondence table, the value of "LabelChange" is equal to 1
if the labels of
the codes of A and A* are different, and 0
if they are the same. Finally, the value of "LabelChange" is empty if
either of the labels, or both labels, are missing. Lower and upper case are considered the same, and punctuation characters
are ignored when comparing code labels.
The argument "Redundancy_trim" is used to delete all the redundancies which are mapping correctly. If the analysis
concludes that the A*code / Bcode mapping is correct for all cases involving redundancies, then an action is needed to remove
the redundancies. If the selected value is TRUE
, all redundant records are removed and kept only one record for each unique
combination. For this record retained, the Acodes, the Alabel and the Asupp information is replaced with ‘multiple’. If the multiple
A records are the same, their value will not be replaced. If the selected value is FALSE
, no trimming is executed so redundant
records are shown, together with the redundancy flag.
Running browseVignettes("correspondenceTables")
in the console opens an html page in the user's default browser.
Selecting HTML from the menu, users can read information about the use of the sample datasets that are included in the
package.
If they wish to access the csv files with the sample data, users have two options:
Option 1: Unpack into any folder of their choice the tar.gz file into which the package has arrived. All sample datasets may be found in the "inst/extdata" subfolder of this folder.
Option 2: Go to the "extdata" subfolder of the folder in which the package has been installed in their PC's R
library. All sample datasets may be found there.
{ ## Application of function updateCorrespondenceTable() with NAICS 2017 being the ## original classification A, NACE being the target classification B, NAICS 2022 ## being the updated version A*, NAICS 2017:NACE being the previous correspondence ## table A:B, and NAICS 2017:NAICS 2022 being the A:A* concordance table. The desired ## name for the csv file that will contain the updated correspondence table is ## "updateCorrespondenceTable.csv", there is no reference classification, and the ## maximum acceptable proportions of unmatched codes between the original ## classification A and the target classification B, and between the original ## classification A and the updated classification A* are 0.5 and 0.3, respectively. tmp_dir<-tempdir() A <- system.file("extdata", "NAICS2017.csv", package = "correspondenceTables") AStar <- system.file("extdata", "NAICS2022.csv", package = "correspondenceTables") B <- system.file("extdata", "NACE.csv", package = "correspondenceTables") AB <- system.file("extdata", "NAICS2017_NACE.csv", package = "correspondenceTables") AAStar <- system.file("extdata", "NAICS2017_NAICS2022.csv", package = "correspondenceTables") UPC <- updateCorrespondenceTable(A, B, AStar, AB, AAStar, file.path(tmp_dir,"updateCorrespondenceTable.csv"), "none", 0.5, 0.3, FALSE) summary(UPC) head(UPC$updateCorrespondenceTable) UPC$classificationNames csv_files<-list.files(tmp_dir, pattern = ".csv") if (length(csv_files)>0) unlink(csv_files) }
{ ## Application of function updateCorrespondenceTable() with NAICS 2017 being the ## original classification A, NACE being the target classification B, NAICS 2022 ## being the updated version A*, NAICS 2017:NACE being the previous correspondence ## table A:B, and NAICS 2017:NAICS 2022 being the A:A* concordance table. The desired ## name for the csv file that will contain the updated correspondence table is ## "updateCorrespondenceTable.csv", there is no reference classification, and the ## maximum acceptable proportions of unmatched codes between the original ## classification A and the target classification B, and between the original ## classification A and the updated classification A* are 0.5 and 0.3, respectively. tmp_dir<-tempdir() A <- system.file("extdata", "NAICS2017.csv", package = "correspondenceTables") AStar <- system.file("extdata", "NAICS2022.csv", package = "correspondenceTables") B <- system.file("extdata", "NACE.csv", package = "correspondenceTables") AB <- system.file("extdata", "NAICS2017_NACE.csv", package = "correspondenceTables") AAStar <- system.file("extdata", "NAICS2017_NAICS2022.csv", package = "correspondenceTables") UPC <- updateCorrespondenceTable(A, B, AStar, AB, AAStar, file.path(tmp_dir,"updateCorrespondenceTable.csv"), "none", 0.5, 0.3, FALSE) summary(UPC) head(UPC$updateCorrespondenceTable) UPC$classificationNames csv_files<-list.files(tmp_dir, pattern = ".csv") if (length(csv_files)>0) unlink(csv_files) }