Comment
open new york

OPEN DATA HANDBOOK

New York State Open Data Initiative

Guidelines for Participating Agencies

Executive Order No. 95 provides a specific definition of “Publishable State Data” to guide covered State agencies. Publishing data on data.ny.gov involves a collaborative multi-step agency process (see Figure 1: Guidance Summary). In identifying Publishable State Data, agencies should include analyses from their executive and program staff, data coordinators, FOIL officers, data stewards/IT, public information officers, security and privacy officers, and legal counsel.

Covered State entities (and entities not covered by Executive Order 95) vary widely in terms of size, personnel, functions, responsibilities, mission, and data collected and maintained. As such, the identification and prioritization processes may vary across agencies and entities. These guidelines serve to provide assistance across a broad spectrum of agencies, with the stipulation that agencies look to their governing laws, rules, regulations, and policies in identifying and publishing “publishable state data.”

Figure 1: Guidance Summary

Guidance Summary

Data Set Identification

In creating a data catalogue, agencies should identify those datasets that are high value, high quality, complete, and in accordance with the definition of “Publishable State Data” within Executive Order 95. “High value” data, as defined within Executive Order 95, is that which can be used to increase the agency’s accountability and responsiveness, improve public knowledge of the agency and its operations, further the mission of the agency, create economic opportunity, or respond to a need or demand identified after public consultation.

The questions in Figure 2, and below, are neither exhaustive nor may be applicable to all agencies, but serve to provide a framework to identify potential data for publication on data.ny.gov. For each question, agencies must assess whether the data falls within the definition of “Publishable State Data” and the disclosure considerations that follow.

Figure 2: Identifying Publishable Data Sets

Identifying Publishable Data Sets

General Questions

Do the datasets represent discrete, usable information

In identifying datasets, government entities may be concerned that users of data.ny.gov will not understand their raw data or, if distilled to its rawest form, might lose utility. For example, state and local rules might differ, such that publishing raw, separate datasets of the two may reduce the value of the raw data being combined into a single dataset.

There are no hard and fast rules about what level of detail is sufficiently granular to add value to a government dataset. Whenever possible, government entities should resist the temptation to limit datasets to only those the agency believes might be understood or useful. Entities should be wary of underestimating the users of data.ny.gov. data.ny.gov users may come from a variety of fields and specialties, including academic and other government users who can envision a use for the raw data not anticipated by the originating entity. A better practice is for the agency to ensure its metadata describing the dataset is complete, including comprehensive overview documents describing the data, data collection, data fields, and presentation of research questions to maximize the utility and usefulness of the data.

Release Prioritization

Executive Order 95 states: “Prioritization of publication of data based on the extent to which the data can be used to increase the covered State entity’s accountability and responsiveness, improve public knowledge of the entity and its operations, further the mission of the entity, create economic opportunity, or respond to a need or demand identified after public consultation…”

Executive Order 95 further states: “Data shall not be Publishable State Data if making such data available on the Open Data Website [data.ny.gov] would…impose an undue financial, operational or administrative burden on the covered State entity or State.”

When creating a schedule for publication of a particular dataset, agencies must make an assessment based upon a number of different factors. Agencies may use the guidance below to determine the priority for each data set. Prioritizing initial and ongoing publication will entail balancing high value with data quality, data availability, and data readiness. Each covered State entity shall create schedules and prioritize data publication in accordance with guidelines set forth herein, and in a timely manner, recognizing that it may take time for agencies to prepare high quality data (noting that datasets vary in complexity and, as such, can significantly vary in preparation time).

In prioritizing data for release, therefore, agencies must account for time to: identify data, assess the data (i.e., ensure consistency, timeliness, relevance, completeness, and accuracy of the data), ensure completeness of the metadata and data dictionary, review and obtain all necessary approvals to publish the data, and prepare data, metadata and requisite accompanying documentation for publication (Figure 3).

Figure 3: Prioritization

Prioritization

Below are suggested questions, the answers to which can assist agencies in prioritizing publication of high value “publishable state data” consistent with Executive Order 95:

  1. Does the data highlight agency performance, or might publication of the data benefit the public by setting higher standards? The agency might be in the forefront of standards for government performance, where exposing the data might cause other agencies to raise their performance.
  2. Has the data ever been published or made available in a machine-readable format so that it can be processed, analyzed, or re-used? There may exist procedures in place which can be leveraged to publish the data, such as exports for periodic department reviews, or routine exchanges of data with other agencies.
  3. Is the data “high value?” While “high value” can be subjective, your agency best understands the needs of the constituency that it serves. Publishing relevant data can ultimately support those needs.
  4. Does availability of the data align with new State and/or Agency initiatives? The ordering publication of any relevant datasets accordingly might be of great value.
  5. Does availability of the data align with federal initiatives or exposures of federal data? There may be higher value in the agency’s data if synergies can be created.
  6. Can publication of the data address regulatory or grant requirements? While some data required by regulations or grants may be inappropriate for public release, publishing the data may be an acceptable way to meet those requirements and make the data accessible to the public simultaneously.
  7. Does the data support decision making at the state, local, internal agency or other external agency’s level, or contain information that informs public policy? Publishing such a dataset publicly can be a powerful platform for fostering productive civic engagement and policy debate.
  8. Is the data timely? What is the dataset refresh and maintenance cycle? Systems, which support the ongoing operations of an agency, are often kept up-to-date on a daily basis. Publishing raw or aggregate data drawn from these systems can provide tremendous value.
  9. Does availability of the data align with legal requirements for data publication? For example, there might be statutorily-required reporting which can be satisfied by publishing datasets, without necessarily needing an extensive narrative report. If the data is collected and compiled by the agency to fulfill statutory reporting requirements, then the agency’s governing laws have already determined that the data is of high value for that agency.
  10. Would availability of the data improve agency-to-agency communication? Certain government functions may involve multiple agencies requiring access to similar data.
  11. Could availability of the data create specific economic opportunity? In many cases, this will be unknown to the agency in advance. Some of the greatest successes of the open data movement have involved government data being commercially appropriated in useful ways, such as weather data. To the extent the agency can anticipate significant commercial use of the data, the agency may wish to order publication of such data more highly as it creates its schedule.
  12. Could the data be useful for the creation of novel and useful third-party applications, mobile applications, and services? Software applications often leverage data from multiple sources to provide value to their customers. Making agency data sets available can support the delivery of greater value (and impact) through those applications.
  13. Does the data further the core mission or strategic direction of the agency or multiple government entities? Publishing aggregated data (statistics, metrics, performance indicators) as well as raw data can often help an agency advance its strategic mission. In addition, data.ny.gov can serve as a conduit for efficiently sharing information with other agencies.
  14. Does the data have depth and breadth of years of coverage? Release of data with high information content and quality can improve accountability and responsiveness and/or improve public knowledge of the agency and its operations.
  15. Does the data have accompanying metadata and a data dictionary? Metadata and all accompanying documents should be comprehensive so as to provide a full understanding of the data and data elements to an end-user. This ensures version control, availability of contact information, and descriptive information sufficient for end-users to be able to use and interpret the data. In addition, where applicable, agencies should append disclaimers to highlight limitation of the data and/or prevent use of the data in misleading ways.
  16. Is the data accurate/complete? The dataset must be sufficiently final or complete, such that it is currently publishable. Agencies should work to transform any data sets or partial data sets which are not complete or high quality so that they can eventually be published. If there is a trigger allowing the agency to publish the data at some time in the future, then scheduling publication of the data should be set accordingly.
  17. Is the dataset in a format that is machine-readable or can be easily transformed? The data should be organized or formatted in a manner which is machine-readable and that can be re-used, and capable of being digitally transmitted or processed. It should be in tabular or geo-spatial form. Agencies should consider the level of effort required to transform the data to a machine-readable format and maintain it in such a format.
  18. Is the data frequently requested? As demand is known and quantifiable, this should raise the value of this data for publication. If the dataset is the type that is requested through FOIL on a recurring basis, then the agency may reduce duplication and obtain efficiencies by posting data on data.ny.gov.
  19. Is the data needed by the public after-hours? As demand may be known and quantifiable. Generally when there is this type of demand for the data, such datasets should be ranked, where applicable, of higher value.
  20. Does the data have a direct impact on the public? The data is likely of higher value if it is already apparent there is a deep impact and interest by the public (e.g., hospital infection rates, food establishment inspection results, etc.).
  21. Is the data in strong demand from constituencies? The data might be of higher value to specific, narrow interest groups which may be the agency’s core constituency for those issues.
  22. Is the data of timely interest? Announcements of progress or success – or reactions to public criticism - can be strongly supported by publishing related data, should it exist.

Disclosure Considerations

As agencies classify data sets and catalogue Publishable State Data, they should be mindful of legal and policy restrictions on publication of certain kinds of data. The following guidelines regarding disclosure provide additional factors for consideration as agencies begin to identify and review datasets.

  1. Security, Privacy, Regulatory, & Aggregate Data. The public release of some agency data might result in the violation of laws, rules, or regulations. Some data may not be appropriate to release because it can compromise internal agency processes, such as procurement. Other data may contain personally identifiable information. Finally, even if detailed data appears innocuous, it may be possible to easily combine it with other public information to reveal sensitive details (commonly known as the mosaic effect).

Even if there are no legal impediments to publishing the data, releasing the data may have unintended or undesirable effects. For example, posting anonymized arrest records on a weekly basis might inadvertently reveal where police are concentrating enforcement efforts.

  1. Thresholds Various statutes and regulations, such as Health Insurance Portability and Accountability Act (“HIPAA“) and its privacy regulations, have very exacting requirements for determining whether data have been sufficiently de-identified so as not to compromise individual privacy. For example, the presence of medical conditions per geographic location might constitute high-value, useful, and sought-after data; however, exposing it might identify individuals and their medical conditions.

Another example is the Family Educational Rights and Privacy Act of 1974 (FERPA). Under FERPA, the Federal Government has established guidelines for data privacy to prevent individuals from being identified indirectly from aggregation of data. Agencies that deal with student educational data should be aware of guidelines that restrict publication of some data.

Even in the absence of specific legal prohibitions, government entities should watch for outlier publication conditions. For example, identifying a single arrestee who is a minor of a certain age in a certain county without providing any other information, might nonetheless serve to identify that particular individual.

For particular datasets that pose such issues agencies may consider providing aggregated data based upon their laws, rules, regulations, and policies. Alternatively, agencies may set disclosure thresholds for the dataset (many agencies already adhere to such standards). For example, if a cell in a particular dataset field goes below a certain number of individuals, the value in that particular cell should be hidden. Government entities will need to balance their desires to publish accurate, complete, and valuable tabulations against the need to guard against unwarranted invasions of personal privacy, in specific situations.

  1. FOIL Applicability Under the NYS Public Officers Law, Article 6 (the NYS Freedom of Information Law, or “FOIL”), the presumption is that government records shall be open to the public, unless excludable under a narrow set of specific exemptions including such concerns as invasion of personal privacy, impairment of contractual or collective bargaining negotiations, exposure of protected trade secrets, interference with law enforcement or judicial proceedings, endangering life or safety, and others. Government entities should confer with their FOIL officers for publication of data on data.ny.gov, and exclude any datasets which, because their publication would cause the harms described in the FOIL law, would not constitute “Publishable State Data.”

  2. Ownership Rights In some circumstances, an agency may not possess all the necessary rights to be able to publish a specific data set. For example, if the data was collected or compiled by a third party, there may be a contractual or intellectual property limitation which prevents it from being made public. In these cases, the appropriate permission must be secured from the sourcing entity, and additional disclaimers may be required.

Narrative Content

Narrative content on its own is not appropriate for publishing on data.ny.gov. However, such content may have been developed based upon existing agency data which has already, or will be, published.

If an agency develops extensive narrative reports about published data, then those reports should be published on the agency’s website, while providing a link to the published data set on data.ny.gov. It is important to keep this link current.

Your comments are important

Please create a GitHub issue or comment on an existing issue to give feedback.

continue ×