ArcticStat-Related Social Indicator Database

Overview

The Arctic Observation Network Social Indicators Project (AON-SIP found at www.search-hd.net ) is funded by a grant from the U.S. National Science Foundation (NSF OPP0638408). AON-SIP is intended to contribute to the development of the Arctic Observation Network and to the science goals of SEARCH in two ways: (1) develop and make available to the science community relevant datasets; and, (2) identify gaps in the existing observation system and recommend appropriate actions to fill those gaps. Social outcomes resulting from an interaction of humans with climate change and other forces for change, referred to as social indicators, constitutes one of five components of the suite of AON-SIP databases. The other four components are: subsistence, tourism, mining, and commercial fishing.

Under the leadership of Gérard Duhaime and direction of Andrée Caron, both at Laval University in Quebec, Arcticstat was developed to make social outcome data more accessible. Arcticstat (http://www.arcticstat.org/index.htm ) is described as follows:

"ArcticStat is a permanent, public and independent statistical database dealing with the countries, regions and populations of the Circumpolar Arctic. ArcticStat was born out of the desire to facilitate comparative research on the socioeconomic conditions of the peoples of the Arctic by bringing together already existing data which are dispersed and often hard to find." (http://www.arcticstat.org/about.aspx)

One objective of AON-SIP has been to integrate social outcome data accessible through Arcticstat. Each national statistics agency contributing to Arcticstat chooses its own social outcome variables and designs its own presentation format. Arcticstat indexes these tables according to indicator domains: dwellings, education, health/social services, households/families, labor force, personal/household income, population, regional accounts, and vital statistics. Within each of these domains, the identified tables often differ in variables and table organization. The tables themselves originally were principally available in PDF format. Arcticstat now offers a choice of PDF and Excel format. Even in the Excel format, however, the tables developed by the contributing national statistics agencies are designed principally for printing.

Database Integration Process

To integrate the Arcticstat social outcome data the AON-SIP team first downloaded comprehensive indexes of Arcticstat tables by domain. Based on this review, the team downloaded 2,239 PDF files. The team used data translation software (ABBYY) to convert PDF files to Excel files. This process, plus downloading other Excel files directly from Arcticstat, yielded a total of 2,410 files. In cases where the ABBYY process proved particularly complicated, the team also saved ABBYY files. There are a total of 940 ABBYY saved files.

The 2,410 Excel files extracted from Arcticstat retained the original print format. The next step was to restructure the Excel file data so that rows were unique combinations of geography (region, state or province, nation) and year and the columns were defined as variables defined in common across all data sources. The team identified the subset of “raw” Excel files that contain fundamental social outcome variables for multiple years. This process resulted in 814 Excel SPSS-ready files.

The AON-SIP team then was able to shift work from Excel to the statistical program, Statistical Package for the Social Sciences (SPSS). A key advantage of SPSS is that all data manipulations can be written as syntax that can be edited and replicated. The team used the following social outcome domain definitions: population, housing, employment, income, education, health, accounts. The 814 SPSS-ready files yielded 1,013 SPSS work files across the seven domains. The number of files increased because many “raw” Excel tables contained variables from two or more domains. The 1,013 SPSS files were integrated in the final step into seven files, one for each domain. Finally, the seven domain-specific files were integrated into one social outcomes SPSS file. Table 1 summarizes the number of lines of SPSS syntax, the number of SPSS variables, and the number of unique location/year record combinations.

 

Table 1: Summary of SPSS Data Processing

 

Domain

Number of SPSS Work Files

Lines of Syntax

Number of Variables

Number of SPSS Final Files

Number of Location/Year Records

Population

246

10,325

2,782

1

3,579

Housing

84

2,612

391

1

1,046

Employment

231

8,788

2,213

1

1,638

Income

174

4,749

862

1

1,736

Education

110

6,887

2,285

1

1,865

Health

108

4,915

1,240

1

1,735

Accounts

60

1,982

286

1

1,023

Social Outcomes Integrated File

0

17

10,059

1

4,439

The large number of variables created by the integration process highlights the challenge of using social outcome data. Nations frequently differ among themselves and over time in the grouping of variables such as age, income, and education. Data tables also differ in the breakdowns they provide, as for example, education by age and gender. The team sought to preserve the highest level of detail available (e.g. single years of age). This approach provides the research community with the maximum flexibility in creating common variables.

The web archive includes the following:     

  1. Seven final SPSS files plus a single SPSS file integrating the seven domain files.

  2. Seven SPSS syntax files used to for all data processing of the Excel SPSS-ready files plus the single SPSS file integrating the seven domain files.

  3. 2,239 PDF files downloaded from ArcticStat

  4. 2,410 “raw” Excel files

  5. 814 Excel SPSS-ready files

  6. 1,013 SPSS work files

  7. Seven Excel files with the following domain-specific information

a.       Variable names and labels

b.      Count of the number of location/year record combinations by variable by major reporting unit (Canada, Greenland, Norway, Sweden, Finland, Russia, Alaska, Iceland, Faroe Islands)

  1. A single Excel file “Excel Master Table List” identifying the source ArcticStat table number for each Excel SPSS-ready file along with the following attributes:  domain, country, first year of data included in table, last year of data included in table, measures included, breakdowns, geography, table title, flag for whether used table to create an SPSS-ready Excel file, and whether table was extracted from the ArcticStat website.

  2. A single Excel file “aon_region_hybrid” that assigns code numbers to nations, subnations (e.g. Greenland, provinces in Canada, Alaska, and regions. Note that since there are multiple types of geographic records (i.e. nation, subnation, region), the variable “subreg” assigns a unique code to every record and the variable “rectype” identifies the level of geography associated with each record.

 Use of ArcticStat AON-SIP Integrated Database

The Arcticstat AON-SIP Integrated Database is a step toward an end-user database. By "end-user" we mean, for example, a policy analyst who wishes to compare education data among arctic countries. End users will find that the current database contains few variables that can be directly compared. The database allows researchers to identify immediately comparable variables. The database also allows researchers to compute comparable variables from other variables in the database. It is our hope that an institution or group of researchers will take the next important step of computing the maximum number of comparable variables. Completion of this next step will not only expand our ability to compare living conditions; it will also serve as the basis for specific recommendations for closing gaps in comparability.

The research team did not attempt to turn every Excel Raw File (i.e. an Excel file created from a PDF file using ABBYY) into an Excel SPSS-ready file. The team chose a subset of Excel files based on the relevance of the data and the number of years covered. Potential data users not finding a particular variable in the SPSS files, may find an Excel Raw File potentially containing the variable by using the "Excel Master Table List".

Arcticstat is being continually updated by the ArcticStat team under the leadership of Gérard Duhaime and the direction of Andrée Caron at Laval University in Quebec City. Researchers using data from the Arcticstat AON-SIP integrated database should credit contributing national statistical agencies. These are:

  1. Statistics Canada (www.canada.gc.ca )

  2. Statistics Greenland (http://www.statgreen.gl )

  3. Statistics Norway (http://statbank.ssb.no )

  4. Statistics Sweden (www.scb.se )

  5. Statistics Iceland (http://www.statice.is )

  6. Statistics Finland (http://www.stat.fi/index_en.html )

  7. Rosstat (Statistics Russia) (http://www.gks.ru )

  8. U.S. Bureau of the Census (http://www.census.gov/ )

  9. Statistics Faroe Islands (http://www.hagstova.fo)

National statistical agencies routinely expect data users to credit their work by individual data table. This form of citation is impractical when each cell in an integrated database may come from a different source table. The Arcticstat AON-SIP Integrated Database is, however, designed so that the source data can be identified through the SPSS syntax file by searching for any particular variable name and the relevant country. For purposes of publication of data from multiple countries, we suggest the following:

"Source: Arcticstat via AON-SIP Integrated Database based on data contributed by the following national statistical agencies: <names of contributing agencies>"

Researchers using the ArcticStat AON-SIP Integrated Database can easily add new tables from Arcticstat (or other sources) to the database by translating the new Excel tables into SPSS using the variable naming codebooks in the domain-specific Excel files (see number seven(a) above), the appropriate geographic codes (see number nine above), and the SPSS match files command.

Table Two is an extract from the count of education variables by location/year record combination by major reporting unit (see number seven(b) above). It shows that, for this small subset of variables, data were directly captured for only location/year records in Canada, Alaska, Finland, and the Faroe Islands. Much more detailed data were captured for Greenland, Iceland, Norway, and Sweden. The more detailed data, however, were captured with different definitions, each therefore requiring different grouping rules to achieve a common set of education variables. Now that every data point is defined by a variable name drawn from a common codebook, however, the groups can be easily accomplished with SPSS syntax and the new variables saved along with the original variables. This is not to say that the process of creating new variables from variables already captured will produce the desired comparison variables across all arctic regions. The process will, however, enable the research community to identify gaps which can be filled by targeted revisions in national reporting.

Table 2: Example of Subset of Education Variables: Count of Number of Location/Year Records for Which Data is Available by Country  
    Canada Greenland Alaska Norway Russia Sweden Finland Iceland Faroe Islands  
Edu01 PERSONS 25 YEARS AND OLDER WITH LESS THAN HIGH SCHOOL     108              
Edu02 PERSONS 25 YEARS AND OLDER WITH HIGH SCHOOL EDUCATION     108              
Edu03 PERSONS 25 YEARS AND OLDER WITH MORE THAN HIGH SCHOOL     108              
Edu04 NATIVE PERSONS 25 YEARS AND OLDER WITH LESS THAN HIGH SCHOOL 4   54              
Edu05 NATIVE PERSONS 25 YEARS AND OLDER WITH HIGH SCHOOL EDUCATION 4   54              
Edu06 NATIVE PERSONS 25 YEARS AND OLDER WITH MORE THAN HIGH SCHOOL     54              
Edu07 Persons 15 years and over without high school certificate 4           21   2  
Edu08 Persons 15 years and over w a high school certificate 4           21   2  
Edu11 Persons 15 years and over who have completed university 4           21   2  

 

The ArcticStat AON-SIP Integrated Database contains over 44 million data cells. Of this total, just over three percent (1,385040 cells) contain data. Researchers working with the existing data will be able to increase the percentage of cells with data by calculating variables from sums of more detailed variables. The addition of new tables will also increase the data saturation percentage.

The original tables containing these data occasionally had footnotes embedded in data cells or a cell structure (e.g. rows and columns) that changed within a single table. ABBYY translation software proved to be a powerful tool in extracting data. The software recognizes all languages included in the tables. It offers the user the ability to review and modify each step in the data extraction process. SPSS also provides valuable checks on the integrity of the data, flagging variables that are read as text due to embedded characters. All data are stored as numbers. It is likely, however, that there are instances in which data in the original tables came through the process with errors. The database is designed with the ability to conduct data integrity checks.

There are two key references for data integrity checks. The domain-specific SPSS syntax files each contain listings of the variable names saved from each Excel input workbook. Each Excel input workbook contains a copy of each of the “raw” Excel files used as a source for the Excel SPSS-ready files. Thus, for example, the Excel SPSS-ready spreadsheet “Population CA 4” is contained within the workbook “SI_Pop_CA_4” and this workbook also contains an Excel spreadsheet “Table 2005-10-04-04”, which is an ArcticStat table number. The user can use the same table number to view the original PDF file as it has the same name.

Users will notice that there are major differences between countries. In large part these differences are the result of differences in the organization of statistical agencies and the data sources included in the database. All data came from ArcticStat with one major exception: Alaska. Members of the AON-SIP team had been previously involved with others including Eric Larson at the Institute of Social and Economic Research at the University of Alaska Anchorage in the development of a social outcomes database derived from U.S. Census data. The AON-SIP team built on this earlier work to produce comparable data at the census area level for four decennial census counts: 1970, 1980, 1990, and 2000. Data were organized by 2000 census area boundaries. Organizing the data by 2000 census area boundaries required estimations based on place data since census area boundaries changed over time.

To avoid problems with file names containing spaces, each data folder contains one or more compressed (zip) files containing electronic files.

Researchers interested in working with the Arcticstat AON-SIP Integrated Database are encouraged to do so. In order to help develop a community of data users, please drop an email to Jack Kruse if you download data. Users finding errors or wishing to add data to the database are encouraged to contact Jack Kruse at afjak@uaa.alaska.edu . He would also be eager for someone to take on the further development and management of these databases.

Directory to Data