Python API ========== The Python API allows users to query Sleeper from Python, and to trigger uploads of data in Parquet files. There is also the ability to upload records directly from the Python API, but this is only intended to be used for very small volumes of data. ## Requirements * Python 3.7+ ## Installation From the `python` directory, run: ```bash pip install . ``` ## Known issues * Python (pyarrow) uses INT64 in saved Parquet files, so the Sleeper schema must use LongType, not IntType for integer columns. ## Functions ### SleeperClient(basename: str) This is the initialiser for the SleeperClient class. This class is also how the user will run the ingest and query functions. * basename: This is the Sleeper instance id. #### Example: ```python from sleeper.sleeper import SleeperClient my_sleeper = SleeperClient('my_sleeper_instance') ``` ### write_single_batch(table_name: str, records_to_write: list) Write a batch of records into Sleeper. As noted above, this is not intended to be used for large volumes of data. * `table_name`: This is the name of the table to ingest data to. * `records_to_write`: This should be a list of dictionaries to write to Sleeper. Each dictionary should contain a single record. #### Example: ```python records = [{'key': 'my_key', 'value': 'my_value'}, {'key': 'my_key2', 'value': 'my_value2'}] my_sleeper.write_single_batch('my_table', records) ``` ### ingest_parquet_files_from_s3(table_name: str, files: list): Ingest data from Parquet files in S3 into the given table. Note that Sleeper must have been given permission to read files in that bucket. This can be done by specifying the `sleeper.ingest.source.bucket` parameter in the instance properties file. * `table_name`: This is the name of the table to ingest data to * `files`: This should be a list of files or directories in S3, in the form `bucket/file`. If directories are specified then all Parquet files contained in them will be ingested. #### Example: ```python files = ["my_bucket/a_directory/", "my_bucket/a_file"] my_sleeper.ingest_parquet_files_from_s3('my_table', files) ``` ### bulk_import_parquet_files_from_s3(table_name: str, files: list, id: str = str(uuid.uuid4()), platform_spec: dict = None): Bulk imports the data from Parquet files in S3 into the given table. Note that Sleeper must have been given permission to read files in that bucket. This can be done by specifying the `sleeper.ingest.source.bucket parameter` in the instance properties file. * `table_name`: This is the name of the table to ingest data to * `files`: This should be a list of files or directories in S3, in the form `bucket/file`. If directories are specified then all Parquet files contained in them will be ingested * `id`: This is the id of the bulk import job. This id will appear in the name of the cluster that runs the job. If no id is provided a random one will be generated. Note that only lower case letters, numbers and dashes should be used. * `platform_spec`: This optional parameter allows you to configure details of the EMR cluster that is created to run the bulk import job. This should be a dict, containing parameters specifying details of the cluster (see the second example below). If this is not provided then sensible defaults are used. #### Example: ```python files = ["my_bucket/a_directory/", "my_bucket/a_file"] my_sleeper.bulk_import_parquet_files_from_s3('my_table', files, 'mybulkimportjob') platform_spec = { "sleeper.table.bulk.import.emr.executor.initial.instances": "1", "sleeper.table.bulk.import.emr.executor.max.instances": "10", "sleeper.table.bulk.import.emr.release.label": "emr-6.10.0", "sleeper.table.bulk.import.emr.master.x86.instance.type": "m6i.xlarge", "sleeper.table.bulk.import.emr.executor.x86.instance.type": "m6i.4xlarge", "sleeper.table.bulk.import.emr.executor.market.type": "SPOT" // Use "ON_DEMAND" for on demand instances } my_sleeper.bulk_import_parquet_files_from_s3('my_table', files, 'my_bulk_import_job', platform_spec) ``` ### exact_key_query(table_name:str, keys: list) Allows for the querying for records matching a specific key from Sleeper. * `table_name`: This is the name of the table to query * `queried_key`: This should be the key or keys to query Sleeper for in the form of a list of dicts * `key_schema_name`: This should be the name given to keys as it appears in their Sleeper instance schema This function returns a list of the records that contain the queried key. Each element of this list is another list containing two tuples, one containing the schema name of the key followed by they key (the one that was queried) and the other tuple contains the associated value. #### Example: ```python my_sleeper.exact_key_query('my_table', {"key": ["akey", "anotherkey", "yetanotherkey"]}) // An equivalent form my_sleeper.exact_key_query('my_table', [{"key": "akey"}, {"key": "anotherkey"}, {"key": "yetanotherkey"}]) ``` And this would return something along the lines of ```python [[('key', 'akey'), ('value', 'my_value')]] ``` In this example, there was one record found with the key `akey` which has the value of `my_value`. If there were more records with this key then the returned list would be longer. ### range_key_query(table_name: str, regions: list) Queries for all records where the key is in a given range, for example between 'a' and 'c'. * `table_name`: This is the name of the table to query * `regions`: A list of regions where each region is a dict with a key of the name of the row key field and the value is a list, either ["a", "c"] or ["a", True, "b", True] where the latter form allows the user to specify whether the ends are inclusive (True means inclusive, the default is that the minimum is inclusive and the maximum is exclusive) * `queried_key_min`: This should be the lower bound of the range to query for * `queried_key_max`: This should be the upper bound of the range to query for * `key_schema_name`: This should be the name given to keys as it appears in their Sleeper instance schema. #### Example function call ```python my_sleeper.range_key_query('my_table', [ {"key": ["a", True, "c", False]} ]) ``` And this would return something similar to ```python [[('key': 'a'), ('value': 'first_key')], [('key': 'b1'), ('value': 'second_key')]. [('key': 'b2'), ('value': 'third_key')]] ``` ### create_batch_writer(table_name) Creates a writer for writing batches of records to Sleeper. * `table_name`: This is the name of the table to ingest data to. The returned object is designed to be used within a Python context manager as follows: ```python with my_sleeper.create_batch_writer('my_table_name') as writer: ... records = [ .... ] ... writer.write(records) ... # Write some more records more_records = ... writer.write(more_records) ``` See the code examples for a working example. ## Examples More examples can be found in the `examples` directory. * `simplest_example.py` - A simple demonstration of interacting with Sleeper. * `large_writes.py` - Example showing how to write large batches of data to Sleeper.