Query the Data Delivery Network
Query the DDNThe easiest way to query any data on Splitgraph is via the "Data Delivery Network" (DDN). The DDN is a single endpoint that speaks the PostgreSQL wire protocol. Any Splitgraph user can connect to it at data.splitgraph.com:5432
and query any version of over 40,000 datasets that are hosted or proxied by Splitgraph.
For example, you can query the emissions_inventory_system_eis_emissions_2017
table in this repository, by referencing it like:
"pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5:latest"."emissions_inventory_system_eis_emissions_2017"
or in a full query, like:
SELECT
":id", -- Socrata column ID
"emis_unit_proc_id", -- A designator used to uniquely identify an emissions process.
"unit_effectivedate", -- The date on which the identifier became effective.
"reductionefficiency", -- The percent reduction achieved for the pollutant when all control measures are operating as designed.
"controlmeasure", -- Identifies the piece of equipment or practice that is used to reduce one or more pollutants. 11 Single Cyclone 13 Multiple Cyclone 14 Centrifugal Separator 21 Spray Chamber 22 Cyclone Scrubber 23 Orfice Scrubber 24 Mechanical Scrubber 25 Venturi Scrubber 26 Packed Tower 27 Wet Filters 28 Water Curtain 29 Water Spray 31 Baghouse - Manual Cleaning 32 Baghouse - Mechanical Shakers 33 Baghouse - Pneumatic Shakers 34 Baghouse - Bag Collapse 35 Baghouse - Sonic Cleaning 36 Baghouse - Reverse Air Flow 37 Baghouse - Reverse Air Jets 39 Baghouse - Unknown Cleaning Mechanism 41 Single Stage Electrostatic Precipitator 42 Two Stage Electrostatic Precipitator 51 Settling Chamber 52 Impingement Separator 53 Panel Filters 54 Enclosure 55 Gravel Bed Filter 56 Annular Ring Filter 57 Demister 58 Continuous Medium Filter 61 Direct Flame Incin. w/o heat exchange 62 Catalytic Incinerator w/o heat exchange 63 Direct Flame Incin. with heat exchange 64 Catalytic Incinerator with heat exchange 71 Fixed Bed Adsorber 72 Continuous Adsorber 81 Surface Condenser 82 Contact Condenser 91 Packed Tower Adsorber 92 Plate or Tray Tower Adsorber 93 Spray Tower Adsorber 94 Spray Chamber Adsorber 95 Venturi Chamber Adsorber 96 Impingement Scrubber 97 Adsorption Column 101 Selective Catalytic Reduction 102 Selective Non-Catalytic Reduction 103 Non-Selective Catalytic Reduction 104 Oxidation Catalyst 105 Low NOx Burners 106 Water Injection 107 Steam Injection 108 Flue Gas Recirculation (FGR) 111 Heat Recovery Steam Generator (HRSG)
"controlapproachdescription", -- Identifies the overall control system or approach, including capture effectiveness, where applied at an emissions unit to reduce the amount of pollutants released into the environment.
"unitdesigncapacity", -- The measure of the size of the unit based on the maximum continuous throughput capacity of the unit. Measured in units of “E6BTU/HR” (million Btu/hr).
"emis_unit_id", -- Identifiers by which the emissions unit is known or has been known.
"effectivedate", -- Date that the Air Quality primary facility was created in eFACTS.
"longitudemeasure", -- The measure of the angular distance on a meridian east or est of the prime meridian.
"ems_unt_prc_rpt_pr_sp_cl", -- An identifier identifying the group of the next 5 columns.
"latitudemeasure", -- The measure of the angular distance on a meridian north or south of the equator.
"locationaddresspostalcode", -- The code the represents a U.S. Zip code or International postal code.
"locationcountycode", -- State's County Code. For Pennsylvania the county codes are from 01 - 67 in Alphabetical order.
"locationcountyname", -- Pennsylvania County Name
"processcomment", -- Any comments regarding the emissions process.
"geocoded_column", -- Georeferenced point column with Latitude and Longitude for use in creating map visuals. A generic point is given for each county to aid in the creation of placing metrics on map by county.
"averagehoursperday", -- The average number of hours per day that the emissions process is active within the reporting period.
"facilitysiteid", -- Identifiers by which the facility site is known or has been know, and the system associated with the identifier
"sp_cl_parameternumeratorumc", -- The numerator unit of measure for the parameter. When this is null, then Parameter Value is a percentage as described in the Parameter Comment. Code values are as follows: E6FT3S: MILLION STANDARD CUBIC FEET, E3GAL: 1000 GALLONS, TON:TONS, E6BTU:MILLION BTUS
"percentspringactivity", -- The percentage of the annual activity that occurred during the Spring months (March, April, May).
"sp_cl_parametercomment", -- Any comments regarding the parameter. Null - TON, Code values are as follows: E6FT3S: MILLION STANDARD CUBIC FEET, E3GAL: 1000 GALLONS, TON:TONS, E6BTU:MILLION BTUS
"pollutant", -- The pollutant which is controlled by the control measure. The number prefixes are part of the chemical name describing the isomer ( ions or molecules with identical formulas but distinct structures ) part of the name. For example - 1,2,3,4,5,6 refer to the positions on the cyclohexane molecule where a chlorine atom is substituted for a hydrogen. The carbon atoms on the cyclohexane ring can be numbered 1 through 6. Due to the symmetry of this particular molecule, it doesn’t really matter where we start counting on the ring. What does matter is that each number is only mentioned once, which means there is only 1 chlorine atom attached to each carbon. One can envision other forms of Hexachlorocyclohexane in which multiple chlorines are attached to a single carbon atom. For example: 1,1,2,2,3,3 Hexachlorocyclohexane would have two chlorine atoms attached to each of 3 sequential carbon atoms, with none on the other three… Now 1,2,3,4,5,6 Hexachlorocyclohexane can refer to a mixture of isomers of that molecule. This is because the ring of cyclohexane’s carbon atoms locks the attached hydrogens and chlorines onto one side of the ring, and this naming convention is not specific enough to tell which chlorine atoms are locked “above” the ring and which are “below it”. So in order to know which specific isomer we are talked about, a more specific name is required. This is where the other numbers can help. “58-89-9” is the Chemical Abstracts Service (registration) number, sometimes abbreviated CAS # or CASRN. This is really not part of the name of the molecule, but rather a unique serial number assigned to the molecule. The number itself is pretty much arbitrarily assigned by the American Chemical Society, but it is important in that it is unique, like a bar code for the molecule. The number is typically written in 3 parts separated by dashes, with two to seven digits in the first part, two in the second, and one in the third. If we google CAS#58-89-9 we find out that we are working with the “gamma” isomer of 1,2,3,4,5,6 Hexachlorocyclohexane where the chlorines are arranged around the ring in an up-up-down-up-up-down configuration. This is also known as “Lindane” a chemical which has been used as an insecticide and a pharmaceutical.
"unitcomment", -- Any comments regarding the emissions unit activity.
"locationaddressstatecode", -- The alphabetic codes that represent the name of the principal administrative subdivision of the United States, Canada, or Mexico. Standard 2-digit state code.
"localityname", -- The name of the city, town, village, or other locality.
"proc_identifier", -- An identifier by which an element is referred to in another system.
"proc_controlapproachdesc", -- Description of the overall control system or approach applied to an emissions unit or process.
"proc_approachcomment", -- Comments regarding the control approach.
"emis_unit_proc_rpt_prd_id", -- An identifier identifying the group of the next 13 columns.
"actualhoursperperiod", -- Actual number of hours the process is active or operating during for the reporting period.
"calculationmaterial", -- Name of the material or fuel processed.
"sp_cl_parametertype", -- Name of the parameter that describes the type of activity, throughput or input used in the calculation.
"averageweeksperperiod", -- The average number of weeks that the emissions process is active within the reporting period.
"percentwinteractivity", -- The percentage of the annual activity that occurred during the Winter months (December, January, February).
"denominatorunitofmeasurecode", -- The denominator for the unit of measure of the reported emission factor (in TONS).
"percentfallactivity", -- The percentage of the annual activity that occurred during the Fall months (September, October, November).
"reporting_pollutant", -- Code identifying the pollutant for which emissions are reported.
"emissioncalculationmethod", -- Defines the method used to calculate emissions.
"totalemissions", -- Total calculated or estimated amount of the pollutant (in TONS).
"emissionfactor", -- The emission factor used for the emissions value if a calculated value was provided (in TONS). The EMISSION_FACTOR = ROUND(source emission amount/throughput amount).
"emis_unit_proc_rpt_prd_emis", -- An identifier identifying the group of the previous 5 columns.
"emissionsyear", -- Year of the emission (discharge of criteria air pollutants (CAPs) and hazardous air pollutants (HAPs))
":@computed_region_rayf_jjgk",
":@computed_region_r6rf_p9et",
":@computed_region_amqz_jbr4",
":@computed_region_d3gw_znnf",
"unitoperationdate", -- The date in which the unit commenced operational activities
"period_startdate", -- The date on which the reporting period began. Applies to the reporting of episodic or event emissions only.
"period_enddate", -- The date on which the reporting period ended.
"release_point_desc", -- Identifiers by which the emissions release point is known or has been known.
"averagepercentemissions", -- The average annual percent of an emissions process that is vented through a release point.
"sp_cl_parameterdenominatorumc", -- The denominator unit of measure for the parameter. When this is null, then Parameter Value is a percentage as described in the Parameter Comment. Code values are as follows: E6FT3S: MILLION STANDARD CUBIC FEET, E3GAL: 1000 GALLONS, TON:TONS, E6BTU:MILLION BTUS
"sp_cl_parametervalue", -- The value of the parameter.
"unitdescription", -- Text description of the emissions unit.
"calculationparametervalue", -- Activity or throughput of the process for a given time period.
"parameterunitofmeasure", -- Code and description for the unit of measure for calculation parameter value.
"averagedaysperweek", -- The average number of days per week that the emissions process is active within the reporting period.
"percentsummeractivity", -- The percentage of the annual activity that occurred during the Summer months (June, July, August).
"facilitysiteidentifier", -- A state's designated identifier by which the facility site is referred to by a system. The name of the facility is in the Facility Site Name column.
":@computed_region_nmsq_hqvv",
"stateandcountyfipscode", -- The list is from FIPS Counties codes used for the identification of the Counties and County equivalents of the United States. The FIPS county code is a five-digit Federal Information Processing Standard (FIPS) code which uniquely identifies counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states. The first 2-digits are the State code and the last 3-digits are the county codes. 42001 Adams 42003 Allegheny 42005 Armstrong 42007 Beaver 42009 Bedford 42011 Berks 42013 Blair 42015 Bradford 42017 Bucks 42019 Butler 42021 Cambria 42023 Cameron 42025 Carbon 42027 Centre 42029 Chester 42031 Clarion 42033 Clearfield 42035 Clinton 42037 Columbia 42039 Crawford 42041 Cumberland 42043 Dauphin 42045 Delaware 42047 Elk 42049 Erie 42051 Fayette 42053 Forest 42055 Franklin 42057 Fulton 42059 Greene 42061 Huntingdon 42063 Indiana 42065 Jefferson 42067 Juniata 42069 Lackawanna 42071 Lancaster 42073 Lawrence 42075 Lebanon 42077 Lehigh 42079 Luzerne 42081 Lycoming 42083 McKean 42085 Mercer 42087 Mifflin 42089 Monroe 42091 Montgomery 42093 Montour 42095 Northampton 42097 Northumberland 42099 Perry 42101 Philadelphia 42103 Pike 42105 Potter 42107 Schuylkill 42109 Snyder 42111 Somerset 42113 Sullivan 42115 Susquehanna 42117 Tioga 42119 Union 42121 Venango 42123 Warren 42125 Washington 42127 Wayne 42129 Westmoreland 42131 Wyoming 42133 York
"unittype", -- Identifies the type of emissions unit activity. 100 Boiler. 270 Incinerator. 180 Process Heater 200 Furnace 220 Calciner 280 Flare 350 Process Equipment Fugitive Leaks 400 Storage Tank 430 Degreaser 450 Spray Booth or Coating Line 450 Spray Booth or Coating Line 470 Printing Line 470 Printing Line 470 Printing Line 480 Gasoline Loading Rack or Arm 600 Chemical Reactor 610 Oxidation Unit 620 Distillation Columns/Stripper 640 Mixer 680 Cooling Tower 720 Crusher 730 Grinder 740 Screen 760 Conveyor 770 Transfer Point 780 Silo 785 Open Storage Pile 255 Dryer
"sourceclassificationcode", -- EPA Source Classification code that identifies an emissions process. The description is given in the next field, processdescription.
"processdescription" -- A text description of the emissions process.
FROM
"pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5:latest"."emissions_inventory_system_eis_emissions_2017"
LIMIT 100;
Connecting to the DDN is easy. All you need is an existing SQL client that can connect to Postgres. As long as you have a SQL client ready, you'll be able to query pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5
with SQL in under 60 seconds.
Query Your Local Engine
bash -c "$(curl -sL https://github.com/splitgraph/splitgraph/releases/latest/download/install.sh)"
Read the installation docs.
Splitgraph Cloud is built around Splitgraph Core (GitHub), which includes a local Splitgraph Engine packaged as a Docker image. Splitgraph Cloud is basically a scaled-up version of that local Engine. When you query the Data Delivery Network or the REST API, we mount the relevant datasets in an Engine on our servers and execute your query on it.
It's possible to run this engine locally. You'll need a Mac, Windows or Linux system to install sgr
, and a Docker installation to run the engine. You don't need to know how to actually use Docker; sgr
can manage the image, container and volume for you.
There are a few ways to ingest data into the local engine.
For external repositories, the Splitgraph Engine can "mount" upstream data sources by using sgr mount
. This feature is built around Postgres Foreign Data Wrappers (FDW). You can write custom "mount handlers" for any upstream data source. For an example, we blogged about making a custom mount handler for HackerNews stories.
For hosted datasets (like this repository), where the author has pushed Splitgraph Images to the repository, you can "clone" and/or "checkout" the data using sgr clone
and sgr checkout
.
Cloning Data
Because pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5:latest
is a Splitgraph Image, you can clone the data from Spltgraph Cloud to your local engine, where you can query it like any other Postgres database, using any of your existing tools.
First, install Splitgraph if you haven't already.
Clone the metadata with sgr clone
This will be quick, and does not download the actual data.
sgr clone pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5
Checkout the data
Once you've cloned the data, you need to "checkout" the tag that you want. For example, to checkout the latest
tag:
sgr checkout pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5:latest
This will download all the objects for the latest
tag of pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5
and load them into the Splitgraph Engine. Depending on your connection speed and the size of the data, you will need to wait for the checkout to complete. Once it's complete, you will be able to query the data like you would any other Postgres database.
Alternatively, use "layered checkout" to avoid downloading all the data
The data in pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5:latest
is 0 bytes. If this is too big to download all at once, or perhaps you only need to query a subset of it, you can use a layered checkout.:
sgr checkout --layered pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5:latest
This will not download all the data, but it will create a schema comprised of foreign tables, that you can query as you would any other data. Splitgraph will lazily download the required objects as you query the data. In some cases, this might be faster or more efficient than a regular checkout.
Read the layered querying documentation to learn about when and why you might want to use layered queries.
Query the data with your existing tools
Once you've loaded the data into your local Splitgraph Engine, you can query it with any of your existing tools. As far as they're concerned, pa-gov/emissions-inventory-system-eis-emissions-2017-p7sf-rbb5
is just another Postgres schema.