splitgraph.ingestion.csv package
Submodules
splitgraph.ingestion.csv.common module
- class splitgraph.ingestion.csv.common.CSVOptions(autodetect_header, autodetect_dialect, autodetect_encoding, autodetect_sample_size, schema_inference_rows, delimiter, quotechar, header, encoding, ignore_decode_errors)
Bases:
tuple
- autodetect_dialect: bool
Alias for field number 1
- autodetect_encoding: bool
Alias for field number 2
- autodetect_header: bool
Alias for field number 0
- autodetect_sample_size: int
Alias for field number 3
- delimiter: str
Alias for field number 5
- encoding: str
Alias for field number 8
- classmethod from_fdw_options(fdw_options)
- header: bool
Alias for field number 7
- ignore_decode_errors: bool
Alias for field number 9
- quotechar: str
Alias for field number 6
- schema_inference_rows: int
Alias for field number 4
- to_csv_kwargs()
- to_table_options()
Turn this into a dict of table options that can be plugged back into CSVDataSource.
- splitgraph.ingestion.csv.common.autodetect_csv(stream: io.RawIOBase, csv_options: splitgraph.ingestion.csv.common.CSVOptions) splitgraph.ingestion.csv.common.CSVOptions
Autodetect the CSV dialect, encoding, header etc.
- splitgraph.ingestion.csv.common.dump_options(options: Dict[str, Any]) Dict[str, str]
- splitgraph.ingestion.csv.common.get_s3_params(fdw_options: Dict[str, Any]) Tuple[minio.api.Minio, str, str]
- splitgraph.ingestion.csv.common.load_options(options: Dict[str, str]) Dict[str, Any]
- splitgraph.ingestion.csv.common.log_to_postgres(*args, **kwargs)
- splitgraph.ingestion.csv.common.make_csv_reader(response: io.IOBase, csv_options: splitgraph.ingestion.csv.common.CSVOptions) Tuple[splitgraph.ingestion.csv.common.CSVOptions, _csv._reader]
- splitgraph.ingestion.csv.common.pad_csv_row(row: List[str], num_cols: int, row_number: int) List[str]
Preprocess a CSV file row to make the parser more robust.
splitgraph.ingestion.csv.fdw module
- class splitgraph.ingestion.csv.fdw.CSVForeignDataWrapper(fdw_options, fdw_columns)
Bases:
object
Foreign data wrapper for CSV files stored in S3 buckets or HTTP
- can_sort(sortkeys)
- execute(quals, columns, sortkeys=None)
Main Multicorn entry point.
- explain(quals, columns, sortkeys=None, verbose=False)
- get_rel_size(quals, columns)
- classmethod import_schema(schema, srv_options, options, restriction_type, restricts)
- splitgraph.ingestion.csv.fdw.log_to_postgres(*args, **kwargs)
- splitgraph.ingestion.csv.fdw.report_errors(table_name: str)
Context manager that ignores exceptions and serializes them to JSON using PG’s notice mechanism instead. The data source is meant to load these to report on partial failures (e.g. failed to load one table, but not others).
Module contents
- class splitgraph.ingestion.csv.CSVDataSource(engine: PostgresEngine, credentials: Credentials, params: Params, tables: Optional[Union[List[str], Dict[str, Tuple[List[splitgraph.core.types.TableColumn], TableParams]]]] = None)
Bases:
splitgraph.hooks.data_source.fdw.ForeignDataWrapperDataSource
- commandline_help: str = 'Mount CSV files in S3/HTTP.\n\nIf passed an URL, this will live query a CSV file on an HTTP server. If passed\nS3 access credentials, this will scan a bucket for CSV files, infer their schema\nand make them available to query over SQL. \n\nFor example: \n\n\x08\n```\nsgr mount csv target_schema -o@- <<EOF\n {\n "s3_endpoint": "cdn.mycompany.com:9000",\n "s3_access_key": "ABCDEF",\n "s3_secret_key": "GHIJKL",\n "s3_bucket": "data",\n "s3_object_prefix": "csv_files/current/",\n "autodetect_header": true,\n "autodetect_dialect": true,\n "autodetect_encoding": true\n }\nEOF\n```\n'
- commandline_kwargs_help: str = "s3_access_key:\ns3_secret_key:\nconnection:\nautodetect_header: Detect whether the CSV file has a header automatically.\nautodetect_dialect: Detect the CSV file's dialect (separator, quoting characters etc) automatically.\nautodetect_encoding: Detect the CSV file's encoding automatically.\nautodetect_sample_size: Sample size, in bytes, for encoding/dialect/header detection.\nschema_inference_rows: Number of rows to use for schema inference.\nencoding: Encoding of the CSV file.\nignore_decode_errors: Ignore errors when decoding the file.\nheader: First line of the CSV file is its header.\ndelimiter: Character used to separate fields in the file.\nquotechar: Character used to quote fields."
- credentials_schema: Dict[str, Any] = {'properties': {'s3_access_key': {'type': 'string'}, 's3_secret_key': {'type': 'string'}}, 'type': 'object'}
- classmethod from_commandline(engine, commandline_kwargs) splitgraph.ingestion.csv.CSVDataSource
Instantiate an FDW data source from commandline arguments.
- classmethod get_description() str
- get_fdw_name()
- classmethod get_name() str
- get_raw_url(tables: Optional[Union[List[str], Dict[str, Tuple[List[splitgraph.core.types.TableColumn], TableParams]]]] = None, expiry: int = 3600) Dict[str, List[Tuple[str, str]]]
Get a list of public URLs for each table in this data source, e.g. to export the data as CSV. These may be temporary (e.g. pre-signed S3 URLs) but should be accessible without authentication. :param tables: A TableInfo object overriding the table params of the source :param expiry: The URL should be valid for at least this many seconds :return: Dict of table_name -> list of (mimetype, raw URL)
- get_remote_schema_name() str
Override this if the FDW supports IMPORT FOREIGN SCHEMA
- get_server_options()
- get_table_options(table_name: str, tables: Optional[Union[List[str], Dict[str, Tuple[List[splitgraph.core.types.TableColumn], TableParams]]]] = None) Dict[str, str]
- classmethod migrate_params(params: Params) Params
- params_schema: Dict[str, Any] = {'properties': {'autodetect_dialect': {'default': True, 'description': "Detect the CSV file's dialect (separator, quoting characters etc) automatically", 'type': 'boolean'}, 'autodetect_encoding': {'default': True, 'description': "Detect the CSV file's encoding automatically", 'type': 'boolean'}, 'autodetect_header': {'default': True, 'description': 'Detect whether the CSV file has a header automatically', 'type': 'boolean'}, 'autodetect_sample_size': {'default': 65536, 'description': 'Sample size, in bytes, for encoding/dialect/header detection', 'type': 'integer'}, 'connection': {'oneOf': [{'type': 'object', 'required': ['connection_type', 'url'], 'properties': {'connection_type': {'type': 'string', 'const': 'http'}, 'url': {'type': 'string', 'description': 'HTTP URL to the CSV file'}}}, {'type': 'object', 'required': ['connection_type', 's3_endpoint', 's3_bucket'], 'properties': {'connection_type': {'type': 'string', 'const': 's3'}, 's3_endpoint': {'type': 'string', 'description': 'S3 endpoint (including port if required)'}, 's3_region': {'type': 'string', 'description': 'Region of the S3 bucket'}, 's3_secure': {'type': 'boolean', 'description': 'Whether to use HTTPS for S3 access'}, 's3_bucket': {'type': 'string', 'description': 'Bucket the object is in'}, 's3_object': {'type': 'string', 'description': 'Limit the import to a single object'}, 's3_object_prefix': {'type': 'string', 'description': 'Prefix for object in S3 bucket'}}}], 'type': 'object'}, 'delimiter': {'default': ',', 'description': 'Character used to separate fields in the file', 'type': 'string'}, 'encoding': {'default': 'utf-8', 'description': 'Encoding of the CSV file', 'type': 'string'}, 'header': {'default': True, 'description': 'First line of the CSV file is its header', 'type': 'boolean'}, 'ignore_decode_errors': {'default': False, 'description': 'Ignore errors when decoding the file', 'type': 'boolean'}, 'quotechar': {'default': '"', 'description': 'Character used to quote fields', 'type': 'string'}, 'schema_inference_rows': {'default': 100000, 'description': 'Number of rows to use for schema inference', 'type': 'integer'}}, 'type': 'object'}
- supports_load = True
- supports_mount = True
- supports_sync = False
- table_params_schema: Dict[str, Any] = {'properties': {'autodetect_dialect': {'default': True, 'description': "Detect the CSV file's dialect (separator, quoting characters etc) automatically", 'type': 'boolean'}, 'autodetect_encoding': {'default': True, 'description': "Detect the CSV file's encoding automatically", 'type': 'boolean'}, 'autodetect_header': {'default': True, 'description': 'Detect whether the CSV file has a header automatically', 'type': 'boolean'}, 'autodetect_sample_size': {'default': 65536, 'description': 'Sample size, in bytes, for encoding/dialect/header detection', 'type': 'integer'}, 'delimiter': {'default': ',', 'description': 'Character used to separate fields in the file', 'type': 'string'}, 'encoding': {'default': 'utf-8', 'description': 'Encoding of the CSV file', 'type': 'string'}, 'header': {'default': True, 'description': 'First line of the CSV file is its header', 'type': 'boolean'}, 'ignore_decode_errors': {'default': False, 'description': 'Ignore errors when decoding the file', 'type': 'boolean'}, 'quotechar': {'default': '"', 'description': 'Character used to quote fields', 'type': 'string'}, 's3_object': {'description': 'S3 object of the CSV file', 'type': 'string'}, 'schema_inference_rows': {'default': 100000, 'description': 'Number of rows to use for schema inference', 'type': 'integer'}, 'url': {'description': 'HTTP URL to the CSV file', 'type': 'string'}}, 'type': 'object'}
- class splitgraph.ingestion.csv.CSVIngestionAdapter
Bases:
splitgraph.ingestion.common.IngestionAdapter
- static create_ingestion_table(data, engine, schema: str, table: str, **kwargs)
- static data_to_new_table(data, engine: PostgresEngine, schema: str, table: str, no_header: bool = True, **kwargs)
- static query_to_data(engine, query: str, schema: Optional[str] = None, **kwargs)
- splitgraph.ingestion.csv.copy_csv_buffer(data, engine: PsycopgEngine, schema: str, table: str, no_header: bool = False, **kwargs)
Copy CSV data from a buffer into a given schema/table
- splitgraph.ingestion.csv.query_to_csv(engine: PsycopgEngine, query, buffer, schema: Optional[str] = None)