WhizzML Reference Manual

4.12 BigML resources

The WhizzML standard library provides procedures to list, create, fetch, update and delete BigML resources. There are thus five generic functions that work for any resource type, as well as specialized versions of the listing and creation calls, where the resource type is implicit.

4.12.1 Resource types

In resources and the list and create families of calls (see below), the resource type can be any of the following supported types:

source
dataset
model
composite
fusion
optiml
ensemble
prediction
batchprediction
evaluation
anomaly
anomalyscore
batchanomalyscore
cluster
centroid
batchcentroid
association
associationset
linearregression
logisticregression
correlation
statisticaltest
topicmodel
topicdistribution
batchtopicdistribution
deepnet
timeseries
forecast
pca
projection
batchprojection
project
configuration
sample
library
script
execution

4.12.2 Resource identifiers

We provide a few methods for getting common information from resources and their identifiers:

(resource-id? obj) \(\rightarrow \) boolean
(parse-resource-id res) \(\rightarrow \) list
(resource-types ) \(\rightarrow \) list
(resource-type res) \(\rightarrow \) string
(resource-id res) \(\rightarrow \) resource-id

To check whether obj is a well-formed resource identifier, use the resource-id? predicate. Resouce identifiers are of the form type/bare-id, and parse-resource-id returns a list of the two components of the identifier, if it is well-formed (or the empty list otherwise). One can ask for a list of all available resource types using the standard procedure resource-types, which takes no arguments and returns a sorted list of strings. Finally, given either a resource map or its identifier, the primitives resource-type and resource-id extract from it the corresponding resource type or resource identifier. In both cases, passing as argument a value that is not either a resource map or a resource identifier produces the empty string as the result of the call.

(resource-id? 3) ;; => false
(resource-id? "source/12121232123123123123123") ;; => true
(parse-resource-id "source/12121232123123123123123")
  ;; => ["source" "12121232123123123123123123"]
(parse-resource-id "not-a-resource-id") ;; => []
(resource-id "source/12121232123123123123123")
  ;; => "source/12121232123123123123123"
  (resource-id (fetch "source/12121232123123123123123"))
  ;; => "source/12121232123123123123123"
(resource-id "nosource/12121232123123123123123123") ;; => ""
(resource-id 3) ;; => ""
(resource-id {"foo" 3}) ;; => ""
(resource-type "source/12121232123123123123123") ;; => "source"
(resource-type (fetch "source/12121232123123123123123")) ;; => "source"
(resource-type "nosource/12121232123123123123123123") ;; => ""
(resource-type 3) ;; => ""
(resource-types) ;; => ["anomaly" "anomalyscore" ... "topicmodel"]

4.12.3 Resource properties

In order to obtain a resource property given its identifier, one needs to fetch it first and then access the resulting map. This is such a common operation that the standard library provides a helper, optimizing to fetching minimal information:

(resource-property res-or-id path [default]) \(\rightarrow \) any

This procedure takes either a resource identifier or a full resource map (such as the ones returned by fetch) and extracts from it it the procedure indicated by the string or list of strings path. If the property is not found, resource-property will throw an error unless a default value (the last, optional argument) has been provided. So, when id is the identifier of a resource, the call

(resource-property id path default)

is loosely equivalent to

((fetch id) path default)

with the difference that the fetch performed by resource-property sets query string parameters minimizing bandwidth. The name of a resource is requested so often that we provided a trivial specialization:

(resource-name res-or-id [default]) \(\rightarrow \) string

Similarly, syntax sugar is provided to inform about whether a source is open for editing.

(source-open? res-or-id) \(\rightarrow \) boolean

See also Fields map retrieval for retrieval of the the "fields" property.

4.12.4 Resource children

In order to model, predict or evaluate, resources are created based on other previously existing resources. For instance, to build a model first you need a source to contain your uploaded data and then a dataset to summarize the information in the source. The dataset becomes the origin for the final model. Therefore, many resources can stem from a given one in a tree-like generation chain. To obtain the structure of those children, we provide the resource-children procedure:

(resource-children res-id) \(\rightarrow \) list

which will produce a list of lists storing the parent to children relations. As an example, if you build an optiml from a dataset built from an existing source and ask for the source children, the result will be:

(resource-children "source/12121232123123123123123") ;; =>

["source/12121232123123123123123"
 ["dataset/5f921b062275c111e903394d" ["optiml/5f921b08ff0f14111001083a"]]]

4.12.5 Error reporting

All resource-related procedures can raise exceptions when the BigML API services report an error in fulfilling the request. These errors are reported by WhizzML by raising, as usual, a map that always has as keys “message” and “code”, with the latter being always the error code -50. In addtion, full details of the error reported by the API, including its code as listed in the BigML API documentation and associated HTTP status and extra information, are reported as a map under the key “cause”. Here’s an example of the error map for a malformed source creation request:

{"code" -50
 "message" "Error computing primitive operation 'create': Bad request: "
 "instruction" {"source" {"lines" [1 1]
                "columns" [0 34]}
                "instruction" "apply"}
 "cause" {"code" -1204
          "http_status" 400
          "extra" ["Data or remote arguments are missing"]}}

Errors during resource handling are treated uniformly using exceptions. That means that whenever you try to use a resource whose status is not either in-progress or finished, the primitives will raise an error and, unless you are using an error handler (see section 2.10 ), execution of your program will stop.

4.12.6 Listing resources

(resources str-type [map-of-options]) \(\rightarrow \) list of map

resources asks the BigML service for a list of available resources belonging to the caller, returned in the form of a list of maps representing resource metadata.

The first argument of resources is a string naming the type of the resources to be listed and the optional map-of-options is a map containing the key/values that you would use in the BigML’s API query string to filter the returned list the corresponding resource type (that is, essentially, the list that you obtain in JSON via the API as the “objects” key value).

For instance, you can paginate over all of your sources with a snippet of the form:

(define (process-source src) ....)

(loop (offset 0)
 (let (srcs (resources "sources" {"offset" offset "limit" 10}))
   (map process-source srcs)
   (when (not (empty? srcs))
     (recur (+ offset (count srcs))))))

Although it is rather trivial to extract a list of identifiers from a list of resource maps, we define it as part of the standard library:

(resource-ids list-of-maps) \(\rightarrow \) list of string

resource-ids is implemented in pure WhizzML as a variation of

(define (resource-ids resources)
    (map resource-id resources))

that incorporates a bit more error checking.

For convenience, we define a list function for each resource type:

(list-sources [maparg-of-options]) \(\rightarrow \) list of map
(list-datasets [maparg-of-options]) \(\rightarrow \) list of map
(list-models [maparg-of-options]) \(\rightarrow \) list of map
(list-composites [maparg-of-options]) \(\rightarrow \) list of map
(list-fusions [maparg-of-options]) \(\rightarrow \) list of map
(list-optimls [maparg-of-options]) \(\rightarrow \) list of map
(list-ensembles [maparg-of-options]) \(\rightarrow \) list of map
(list-predictions [maparg-of-options]) \(\rightarrow \) list of map
(list-batchpredictions [maparg-of-options]) \(\rightarrow \) list of map
(list-evaluations [maparg-of-options]) \(\rightarrow \) list of map
(list-anomalies [maparg-of-options]) \(\rightarrow \) list of map
(list-anomalyscores [maparg-of-options]) \(\rightarrow \) list of map
(list-batchanomalyscores [maparg-of-options]) \(\rightarrow \) list of map
(list-clusters [maparg-of-options]) \(\rightarrow \) list of map
(list-centroids [maparg-of-options]) \(\rightarrow \) list of map
(list-batchcentroids [maparg-of-options]) \(\rightarrow \) list of map
(list-associations [maparg-of-options]) \(\rightarrow \) list of map
(list-associationsets [maparg-of-options]) \(\rightarrow \) list of map
(list-linearregressions [maparg-of-options]) \(\rightarrow \) list of map
(list-logisticregressions [maparg-of-options]) \(\rightarrow \) list of map
(list-correlations [maparg-of-options]) \(\rightarrow \) list of map
(list-statisticaltests [maparg-of-options]) \(\rightarrow \) list of map
(list-topicmodels [maparg-of-options]) \(\rightarrow \) list of map
(list-topicdistributions [maparg-of-options]) \(\rightarrow \) list of map
(list-batchtopicdistributions [maparg-of-options]) \(\rightarrow \) list of map
(list-deepnets [maparg-of-options]) \(\rightarrow \) list of map
(list-timeseriess [maparg-of-options]) \(\rightarrow \) list of map
(list-forecasts [maparg-of-options]) \(\rightarrow \) list of map
(list-pcas [maparg-of-options]) \(\rightarrow \) list of map
(list-projections [maparg-of-options]) \(\rightarrow \) list of map
(list-batchprojections [maparg-of-options]) \(\rightarrow \) list of map
(list-projects [maparg-of-options]) \(\rightarrow \) list of map
(list-configurations [maparg-of-options]) \(\rightarrow \) list of map
(list-samples [maparg-of-options]) \(\rightarrow \) list of map
(list-libraries [maparg-of-options]) \(\rightarrow \) list of map
(list-scripts [maparg-of-options]) \(\rightarrow \) list of map
(list-executions [maparg-of-options]) \(\rightarrow \) list of map

4.12.7 Creating resources

(create str-type map-of-options) \(\rightarrow \) resource-id
(create str-type res-parent [map-of-options]) \(\rightarrow \) resource-id
(create str-type res-parent res-parent-2 [map-of-options]) \(\rightarrow \) resource-id

In create calls, str-type can be any of the supported resource types listed in subsection 4.12.1 and the options map accepts the same keys and values as the JSON body of an API call to create the respective resource. For instance, a call to create a remote source could be as simple as:

(create "source" {"remote" "https://static.bigml.com/csv/iris.csv"})

while for a source named “test source” with a brief description and explicit source parser we would write:

(create "source" {"remote" "https://static.bigml.com/csv/iris.csv"
                  "name" "test source"
                  "description" "powered by whizzml"
                  "source_parser" {"separator" ";"
                                   "header" false}})

create just launches resource creation, and doesn’t wait for its completion. It returns the new resource identifier, as a string. Typically, you will associate that identifier to a variable for later use:

(define src-id (create "source" {"remote" "s3://bucket.com/data.csv"}))
(define ds-id (create "dataset" {"source" src-id}))

There are two other forms of create taking, in addition to an optional options map, either one or two resource identifiers which will be the parent or origin of the newly created resource. For instance, the parent of a model is a dataset, the parent of a dataset can be either a source or another dataset (when creating subsamplings or filtering), and the parents of an evaluation or batchprediction are a dataset and a model. Table 4.1 shows the full list of possible parent resources (resources not appearing in the table don’t have any valid parent). Some examples:

(create "dataset" "source/121212321231231231231233")
(create "dataset" "dataset/a212b2321231231231231233" {"sample_rate" 0.7})
(create "evaluation"
        "dataset/a212b2321231231231231233"
        "model/562fe2d8636e1c5ec500688c")
(create "batchcentroid"
        "dataset/a212b2321231231231231233"
        "cluster/562fe2d8636e1c5ec500688c"
        {"name" "test"})

Created resource	Parent resource	Second parent resource
dataset	dataset, source
sample	dataset
model	dataset
ensemble	dataset
optiml	dataset
linearregression	dataset
logisticregression	dataset
timeseries	dataset
forecast	timeseries
deepnet	dataset
pca	dataset
batchprojection	pca	dataset
projection	pca
prediction	fusion, model, deepnet, timeseries, ensemble, logisticregression, linearregression
evaluation	fusion, model, deepnet, timeseries, ensemble, logisticregression, linearregression	dataset
batchprediction	fusion, model, deepnet, ensemble, logisticregression, linearregression	dataset
cluster	dataset
centroid	cluster
batchcentroid	cluster	dataset
anomaly	dataset
anomalyscore	anomaly
batchanomalyscore	anomaly	dataset
association	dataset
associationset	association
statisticaltest	dataset
correlation	dataset
topicmodel	dataset
topicdistribution	topicmodel
batchtopicdistribution	topicmodel	dataset
execution	script

Table 4.1 Possible parent resources in calls to create

For convenience, the standard library offers a method, create*, which will create a list of resources in parallel, without waiting for completion:

(create* list-of-types list-of-options) \(\rightarrow \) list of resource-id

For example:

(create* ["source" "source"]
           [{"remote" "http://url/1"} {"remote" "http://url/2"}])

If all resources to be created are of the same type, you may pass a single string for the first parameter, which will be duplicated implicitly. Thus, the following is equivalent to the above call:

(create* "source" [{"remote" "http://url/1"} {"remote" "http://url/2"}])

Instead of providing a map of options, you can also use a parent resource when it’s unique, and mix them as needed with options, as in the following examples:

(create* ["source" "dataset" "model"]
           [{"remote" "http://static.bigml.com/csv/iris.csv"}
            "source/121212321231231231231233"
            "dataset/a212b2321231231231231233"])

The standard library also provides convenience procedures for creation of specific resource types, for each of the ways the basic create primitive can be invoked. Thus, there is a collection of procedures for creating resources given either a single options map, a parent resource identifier and an optional options map, and (for those resources created from two parents, see Table 4.1 ) two parent resource identifiers plus an optional options map:

(create-source [res-id res-id-2 map-of-options]) \(\rightarrow \) source-id
(create-dataset [res-id res-id-2 map-of-options]) \(\rightarrow \) dataset-id
(create-model [res-id res-id-2 map-of-options]) \(\rightarrow \) model-id
(create-composite [res-id res-id-2 map-of-options]) \(\rightarrow \) composite-id
(create-fusion [res-id res-id-2 map-of-options]) \(\rightarrow \) fusion-id
(create-optiml [res-id res-id-2 map-of-options]) \(\rightarrow \) optiml-id
(create-ensemble [res-id res-id-2 map-of-options]) \(\rightarrow \) ensemble-id
(create-prediction [res-id res-id-2 map-of-options]) \(\rightarrow \) prediction-id
(create-batchprediction [res-id res-id-2 map-of-options]) \(\rightarrow \) batchprediction-id
(create-evaluation [res-id res-id-2 map-of-options]) \(\rightarrow \) evaluation-id
(create-anomaly [res-id res-id-2 map-of-options]) \(\rightarrow \) anomaly-id
(create-anomalyscore [res-id res-id-2 map-of-options]) \(\rightarrow \) anomalyscore-id
(create-batchanomalyscore [res-id res-id-2 map-of-options]) \(\rightarrow \) batchanomalyscore-id
(create-cluster [res-id res-id-2 map-of-options]) \(\rightarrow \) cluster-id
(create-centroid [res-id res-id-2 map-of-options]) \(\rightarrow \) centroid-id
(create-batchcentroid [res-id res-id-2 map-of-options]) \(\rightarrow \) batchcentroid-id
(create-association [res-id res-id-2 map-of-options]) \(\rightarrow \) association-id
(create-associationset [res-id res-id-2 map-of-options]) \(\rightarrow \) associationset-id
(create-linearregression [res-id res-id-2 map-of-options]) \(\rightarrow \) linearregression-id
(create-logisticregression [res-id res-id-2 map-of-options]) \(\rightarrow \) logisticregression-id
(create-correlation [res-id res-id-2 map-of-options]) \(\rightarrow \) correlation-id
(create-statisticaltest [res-id res-id-2 map-of-options]) \(\rightarrow \) statisticaltest-id
(create-topicmodel [res-id res-id-2 map-of-options]) \(\rightarrow \) topicmodel-id
(create-topicdistribution [res-id res-id-2 map-of-options]) \(\rightarrow \) topicdistribution-id
(create-batchtopicdistribution [res-id res-id-2 map-of-options]) \(\rightarrow \) batchtopicdistribution-id
(create-deepnet [res-id res-id-2 map-of-options]) \(\rightarrow \) deepnet-id
(create-timeseries [res-id res-id-2 map-of-options]) \(\rightarrow \) timeseries-id
(create-forecast [res-id res-id-2 map-of-options]) \(\rightarrow \) forecast-id
(create-pca [res-id res-id-2 map-of-options]) \(\rightarrow \) pca-id
(create-projection [res-id res-id-2 map-of-options]) \(\rightarrow \) projection-id
(create-batchprojection [res-id res-id-2 map-of-options]) \(\rightarrow \) batchprojection-id
(create-project [res-id res-id-2 map-of-options]) \(\rightarrow \) project-id
(create-configuration [res-id res-id-2 map-of-options]) \(\rightarrow \) configuration-id
(create-sample [res-id res-id-2 map-of-options]) \(\rightarrow \) sample-id
(create-library [res-id res-id-2 map-of-options]) \(\rightarrow \) library-id
(create-script [res-id res-id-2 map-of-options]) \(\rightarrow \) script-id
(create-execution [res-id res-id-2 map-of-options]) \(\rightarrow \) execution-id

Using the resource-specific create procedures is thus just a matter of directly translating the corresponding calls to the basic create primitive:

(create-source {"remote" "s3://bucket.com/data.csv"}))
(create-dataset "source/121212321231231231231233")
(create-ensemble "dataset/ababab32ab3ab312312f1233"
                 {"number_of_models" 13})
(create-prediction "model/562fe2d8636e1c5ec500688c"
                   {"input_data" {"age" 23}})
(create-batchanomalyscore "anomaly/fffe2d8636e1c5ec50069ac"
                          "dataset/a212b2321231231231231233")

All create procedures will implicitly wait for their parent resources to finish, without the need for explicit calls to wait (see subsection 4.12.8 below) in your code. Thus, despite the fact that when you call, say

(create-source {"remote" "s3://bucket.com/data.csv"}))

the given source is just queued for creation when the procedure create-source returns its identifier, that identifier can be immediately used in other create calls, and the WhizzML runtime will make sure that all parent resources are finished before starting creating their children. Thus, the following "one-click" ensemble from a source identifier is safe:

(let (src-id (create-source {"remote" "s3://bucket.com/data.csv"})
      ds-id (create-dataset src-id))
  (create-ensemble ds-id {"number_of_models" 20}))

and could even be rewritten without intermediate variables as:

(create-ensemble (create-dataset (create-source {"remote"
                                                 "s3://bucket.com/data.csv"}))
                 {"number_of_models" 20})

It’s also possible to list the ids of created (and not deleted) resources, at any point during the execution of a whizzml program:

(created-resources ) \(\rightarrow \) list of resource-id

So, for instance, you could delete all the resources created during a script execution with the following expression in your source:

(for (id (created-resources)) (delete id))

Some batch resources can create an additional dataset, whose identifier is always found in the output_dataset_resource property. To obtain a list of created resources that also includes those datasets you can use:

(created-resources* ) \(\rightarrow \) list of resource-id

To identify resources containing an associated dataset, use:

(batch-resource-types ) \(\rightarrow \) list of resource-type

This standard procedure returns a list of resource types, all of which have an optional dataset associated to them. Typical examples are “batchprediction” or “batchprojection”.

4.12.8 Waiting for resource completion

Resources created by the create family of functions will evolve from state 1 (queued) to state 5 (finished) or -1 (faulty). The wait and wait* procedures will block waiting for the resources status to be 5 before returning its identifier, or signal an error if it reaches -1.

(wait res [int-timeout]) \(\rightarrow \) resource-id
(wait* list-of-res-id) \(\rightarrow \) list of resource-id

wait returns res as soon as the resource reaches its finished status or the (optional) timeout expires. If you want to wait forever, don’t pass any timeout to wait ¹ .

The standard procedure wait* just waits in turn for each of the resources in list-of-res-id, and returns the list of resources upon completion. It can be defined in pure WhizzML simply as:

(define (wait* ids) (map wait ids))

If any resource enters a failed state while waiting (or is failed right away), the waiting functions signal error code -50.

Note that, frequently, you will not need to explicitly call wait on resources that are going to be use to create other resources, since, as explained in subsection 4.12.7 , the creation primitives implicitly wait for parent resource completion. The most common use cases for wait or wait* explicit calls are just before calling fetch to access the metadata of a resource (for instance, you want to use the histogram of a dataset’s field, and therefore need to make sure the dataset creation is finished) and when assigning a script’s outputs (to ensure they are usable immediately after the script execution finishes).

4.12.9 Creating and waiting for resource completion in one call

We offer convenience procedures that will create a resource and use wait until it’s either finished or in error. The generic procedure is called create-and-wait, and takes the resource type and a map of creation parameters as arguments:

(create-and-wait str-type map-of-options) \(\rightarrow \) resource-id

where the arguments have the same meaning as for create (see subsection 4.12.7 ).

As with create* above, there is an equivalent method, create-and-wait* to create a list of resources in parallel and wait for them all to complete.

(create-and-wait* listof-types list-of-options) \(\rightarrow \) list of resource-id

If the creation of all resources completes successfully, the procedure returns a list of resource ids. If not, the procedure attempts to delete all resources in the list, completed or not, and raises an error with code -60 and the id of the first failed resource.

There are also specific versions of create and wait for each resource type, each taking as their single argument a map specifiying the creation parameters:

(create-and-wait-source map-of-options) \(\rightarrow \) source-id
(create-and-wait-dataset map-of-options) \(\rightarrow \) dataset-id
(create-and-wait-model map-of-options) \(\rightarrow \) model-id
(create-and-wait-composite map-of-options) \(\rightarrow \) composite-id
(create-and-wait-fusion map-of-options) \(\rightarrow \) fusion-id
(create-and-wait-optiml map-of-options) \(\rightarrow \) optiml-id
(create-and-wait-ensemble map-of-options) \(\rightarrow \) ensemble-id
(create-and-wait-prediction map-of-options) \(\rightarrow \) prediction-id
(create-and-wait-batchprediction map-of-options) \(\rightarrow \) batchprediction-id
(create-and-wait-evaluation map-of-options) \(\rightarrow \) evaluation-id
(create-and-wait-anomaly map-of-options) \(\rightarrow \) anomaly-id
(create-and-wait-anomalyscore map-of-options) \(\rightarrow \) anomalyscore-id
(create-and-wait-batchanomalyscore map-of-options) \(\rightarrow \) batchanomalyscore-id
(create-and-wait-cluster map-of-options) \(\rightarrow \) cluster-id
(create-and-wait-centroid map-of-options) \(\rightarrow \) centroid-id
(create-and-wait-batchcentroid map-of-options) \(\rightarrow \) batchcentroid-id
(create-and-wait-association map-of-options) \(\rightarrow \) association-id
(create-and-wait-associationset map-of-options) \(\rightarrow \) associationset-id
(create-and-wait-linearregression map-of-options) \(\rightarrow \) linearregression-id
(create-and-wait-logisticregression map-of-options) \(\rightarrow \) logisticregression-id
(create-and-wait-correlation map-of-options) \(\rightarrow \) correlation-id
(create-and-wait-statisticaltest map-of-options) \(\rightarrow \) statisticaltest-id
(create-and-wait-topicmodel map-of-options) \(\rightarrow \) topicmodel-id
(create-and-wait-topicdistribution map-of-options) \(\rightarrow \) topicdistribution-id
(create-and-wait-batchtopicdistribution map-of-options) \(\rightarrow \) batchtopicdistribution-id
(create-and-wait-deepnet map-of-options) \(\rightarrow \) deepnet-id
(create-and-wait-timeseries map-of-options) \(\rightarrow \) timeseries-id
(create-and-wait-forecast map-of-options) \(\rightarrow \) forecast-id
(create-and-wait-pca map-of-options) \(\rightarrow \) pca-id
(create-and-wait-projection map-of-options) \(\rightarrow \) projection-id
(create-and-wait-batchprojection map-of-options) \(\rightarrow \) batchprojection-id
(create-and-wait-project map-of-options) \(\rightarrow \) project-id
(create-and-wait-configuration map-of-options) \(\rightarrow \) configuration-id
(create-and-wait-sample map-of-options) \(\rightarrow \) sample-id
(create-and-wait-library map-of-options) \(\rightarrow \) library-id
(create-and-wait-script map-of-options) \(\rightarrow \) script-id
(create-and-wait-execution map-of-options) \(\rightarrow \) execution-id

4.12.10 Fetching resources

The fetch call, which takes a resource identifier, retrieves the full resource metadata in its current status.

(fetch res [map-of-options]) \(\rightarrow \) resource map

The optional map-of-options argument is a map with any desired key/values to use in the HTTP GET requests used to fetch the resource. Typical parameters are fields filters, as in the following example:

(fetch "source/1212222343556aa343433"
       {"fields" "000000,00000a" "offset" 10})

4.12.11 Updating resources

To update an existing resource given its id and a map describing the changes to apply (again, with the key/values that you would use in a regular API call), use:

(update res map) \(\rightarrow \) boolean
(update-and-wait res map) \(\rightarrow \) resource-id

The update primitive makes sure the requested resource is finished (waiting for it to finish if necessary) and requests from the server the given update, specified by means of map. The procedure returns immediately the resource identifier if the server has accepted the update request, signaling an error code -50 if the server cannot be contacted or refuses the request.

Resource updates are generally an asynchronous operation in BigML, so you will sometimes want to wait on an updated resource (see subsection 4.12.8 ) in order to see the change you just requested in a fetch call: the built-in update-and-wait will do that in a single step, and it could be implemented in pure WhizzML as:

(define (update-and-wait id params)
    (wait (update id params)))

Note however that you do not need explicit calls to wait or update-and-wait in order to use an updated resource as the parent of another one (see also subsection 4.12.7 and subsection 4.12.8 ), since the corresponding create call will implicitly wait for you. Thus, for instance, in the following call:

(create-model (update ds-id {"objective_field" {"id" "000001"}}))

it is guaranteed that the model will be created using "000001" as its objective field (in other words, the update operation is started and completed before create-model starts, despite the fact that it is asynchronous).

4.12.12 Deleting Resources

(delete res [map]) \(\rightarrow \) boolean
(delete* list-of-res-id) \(\rightarrow \) list of boolean

The delete function deletes any resource type from your account. On success, delete returns true. There are a few cases where a delete request may be accompanied by options (which in the API appear in the request’s query string). For instance, when deleting executions, one can request the deletion of their child resources by setting delete_all to true. For those cases, delete accepts an optional map argument, map. delete* iterates over the given list of resource identifiers, deleting all of them and returning a list of success flags.

Examples:

(delete "sample/57a3c4da58a27e5803005880")
  (delete "execution/57abf210eb3273117e000000" {"delete_all" true})

4.12.13 Field procedures

The standard library includes some helper procedures to aid in the manipulation of field maps and individual fields, as described below.

Field descriptors are present in many BigML resources, usually as a map under the key “fields”, and play an important role in most workflows. Each field is identified by a unique identifier (usually, a key in a fields map) and is described as a map with keys such as “name”, “optype” or “summary”.

Fields map retrieval

To extract the fields map from a resource we can use:

(resource-fields res-or-id) \(\rightarrow \) map

This procedure takes either a resource identifier or a full resource map (such as the ones returned by fetch) and extracts from it its map of fields. If the given argument is not of the correct type, an empty map is returned. For convenience, all the values in the returned map contain the key "id" with the corresponding field identifier.

Once we have at our disposal a fields map, a very common operation is to fetch from it the descriptor of a single field. If we know the field identifier, that operation is trivial (just a map lookup), but it’s often the case that we want to perform a lookup by field name.

(find-field map-of-fields str) \(\rightarrow \) map

The standard procedure find-field takes a fields map (as returned by, e.g., resource-fields) and looks up an individual field by either its identifier or its name (passed as str). The procedure returns false if the lookup fails.

Field properties

An individual field descriptor is a map with the field’s properties. To make sure a map value is actually a field descriptor you can use the field? predicate:

(field? map) \(\rightarrow \) boolean

This procedure will make sure that the passed map has a “name” and an “optype” keys, with valid a value for them, so that they contain the bare minimum information related to a field.

There is a collection of predicates to check the optype of a given field:

(categorical-field? map) \(\rightarrow \) boolean
(numeric-field? map) \(\rightarrow \) boolean
(text-field? map) \(\rightarrow \) boolean
(items-field? map) \(\rightarrow \) boolean
(regions-field? map) \(\rightarrow \) boolean
(image-field? map) \(\rightarrow \) boolean
(path-field? map) \(\rightarrow \) boolean
(datetime-field? map) \(\rightarrow \) boolean

An important bit of information contained in field summaries is the field’s distribution, i.e., how the field’s values are distributed across categories, bins, items or terms, depending on their specific optype. In all cases, the distribution is represented as a list of pairs. In each pair, the first component is the value that is being counted (category, bin center, item name, term), and the second component is its count (number of instances associated to the first value). The standard procedure field-distribution gives access to that information, regardless of the field’s optype:

(field-distribution map) \(\rightarrow \) list-of-pair

As mentioned, for categorical fields this procedure will return the “categories” in the field’s “summary”, for numeric field it will retrive either “bins” or “counts”, for text field the key inside the summary will be “tag_cloud” and, for item fields, “items”.

In the case of categorical, items and text fields, it is often useful to get a list of all the first elements in the distribution, which correspond, respectively, to the list of categories, items and terms for the field. For convenience, there are predefined procedures returning directly those lists:

(field-categories map) \(\rightarrow \) list-of-string
(field-items map) \(\rightarrow \) list-of-string
(field-terms map) \(\rightarrow \) list-of-string

Field maps

Same resources taking as inputs collections of other resources that contain fields need a mapping from a set of fields to another one. For instance, when creating multi–datasets one may need to specify a mapping between fields of different input datasets; or, when making a batch prediction, sometimes we need to specify in the request what fields of the dataset to be scored correspond to the model fields. In those and other cases, a fields map is specified as a WhizzML map that maps identifiers between two sets of fields, with the default being an identity map.

It is not rare to find cases where the fields match by name instead of by identifier, and we need to construct an identifiers map associating together fields of the same name. E.g., given field maps:

{"000000" {"name" "field a" ...}
   "000001" {"name" "field b" ...}}

and

{"000000" {"name" "field b" ...}
   "000001" {"name" "field a" ...}}

we would like to specify the fields map:

{"000000" "000001"
   "000001" "000000"}

This happens often enough that the standard library provides a function to compute a fields map matching by the names of fields in two input collections (maps) of fields:

(match-fields-by-name from-fields to-fields) \(\rightarrow \) map

So for instance, in simple cases, we could construct a fields map for a batch prediction from the correspoding supervised model and dataset with code along the lines of:

(match-fields-by-name ((fetch model-id) ["model" "model_fields"])
                        (resource-fields dataset-id))

4.12.14 Dataset procedures

This section describes standard procedures specific to the creation and manipulation of datasets.

Objective field

(dataset-get-objective-id dataset-id) \(\rightarrow \) string

Explores the given dataset metadata map and extracts from it the preferred objective field identifier. Some datasets have it already precomputed, and the function is then rather trivial (basically, a get-in); otherwise, a valid objective is selected from the field information, following the same algorithm as BigML’s server side.

Row distance

BigML defines a positive-definite metric between instances of a dataset (used, for instance in clustering algorithms), which depends only on the properties of the dataset’s fields. The primitives row-distance and row-distance-squared provide access to that metric.

(row-distance map-of-fields map-point [map-point2 map-scales]) \(\rightarrow \) number
(row-distance-squared map-of-fields map-point [map-point2 map-scales]) \(\rightarrow \) number

The squared version is provided for convenience, as it’s computationally more efficient, and squared distances are used directly in many cases. All arguments are maps with field identifiers as keys. The parameter map-of-fields gives, for each field identifier, its descriptor map (as found in any BigML resource under the “fields” key); map-point and map-point2 are maps from field identifier to field value, and each one therefore defines a dataset instance. Finally, map-scales associates to each field a numerical scale to be used by the metric during the computation (that allows weighting of the individual dimensions involved in the distance computation).

Dataset splits

Splitting a dataset in two disjoint parts is a common operation, used for instance to separate a testing subset of our input data for evaluation purposes.

The WhizzML standard library provides two procedures for creating dataset splits:

(create-dataset-split dataset-id rate seed [first-options second-options]) \(\rightarrow \) list-of-dataset-id
(create-random-dataset-split dataset-id rate [first-options second-options]) \(\rightarrow \) list-of-dataset-id

To create a split, create-dataset-split needs an input dataset, given by its identifier, a sampling rate (a number between 0.0 and 1.0), which indicates the portion of the dataset that is sampled in the first part (so it’ll be composed of \(N * \mathrm{rate}\) instances, where \(N\) is the total number of instances in the input dataset, while the second part will be composed of those instances not in the first, and therefore have \(N * (1 - \mathrm{rate})\) instances) and a seed used to initialize the random number generator that is used to select instances. If you pass the same seed to two calls to create-dataset-split you’ll obtain identical results. For convenience, the standard library includes create-random-dataset-split, which picks up a random seed for you, and that is simply defined as:

(define (create-random-dataset-split dataset-id rate)
    (create-dataset-split dataset-id rate (str (rand-int 100000))))

Both procedures return a list of two elements, namely the identifiers of the datasets containing, respectively, the first and second parts of the instances in the input dataset.

Both procedures also take two optional maps, first-options and second-options, which are options that will be passed to the dataset creation calls for each half of the split. For instance, if you want that the first dataset in a split is called “First half” and the second “Second half”, you would use something like:

(create-random-dataset-split "dataset/a212b2321231231231231233"
                               0.8
                               {"name" "First half"}
                               {"name" "Second half"})

Dataset merges

Multiple datasets can be merged into one in BigML simply by passing an "origin_datasets" list to create-dataset, with the only limitation that there is a maximum number of datasets accepted ² . To skip that limitation, and perform the merge in an as parallel way as possible, we provide the merge-datasets primitive:

(merge-datasets list-of-datasets [map-params]) \(\rightarrow \) dataset-id

The list of datasets can contain either dataset identifiers, or maps specifying the id and additional properties such as field maps, and any additional arguments passed as map-params will be used in all internal dataset creation requests. Here’s an example of a script concatenating a dataset a hundred times, starting from an inline source:

(define data "a,b,c,1\na,b,c,2\nb,c,d,3\nb,b,c,3\na,a,a,4")
(define src-id (create-source {"data" data}))
(define ds-id (create-dataset src-id))

(define name "whizzml-test")
(define mds (merge-datasets (repeat 100 {"id" ds-id}) {"name" name}))

If we wanted to juxtapose instead of concatenate, we could write

(define mds (merge-datasets (repeat 100 {"id" ds-id})
                              {"name" name "juxtapose" true}))

instead.

4.12.15 Execution procedures

This section describes auxiliary procedures that can be helpful when using the results of an execution as inputs for other executions.

Executions can be used in scripts as any other resource. It’s a common practice to use the outputs of an execution as values in a script. In order to help doing that, the following procedures might be handy:

(execution-inputs execution-id [list-of-names]) \(\rightarrow \) list
(execution-outputs execution-id [list-of-names]) \(\rightarrow \) list
(execution-output-resources execution-id [list-of-names]) \(\rightarrow \) list
(execution-sources execution-id) \(\rightarrow \) list
(execution-logs execution-id) \(\rightarrow \) list

The mandatory argument for all the procedures will be the execution-id of the execution that stores the information. In addition to that, some of the procedures will accept as a second optional argument a list of names to filter out the items to be included in the returned list. execution-inputs will return the list of values used as input arguments in the execution (filtered by argument name, if the second argument is used). execution-outputs will return the list of outputs (filtered by the output name if the second argument is set). The execution-output-resources will return the list of BigML resources created in the execution and the variables they were assigned to, if any. If the second argument is used, the list will be filtered by variable name. execution-sources returns a list of the scripts being executed and their dependencies, if any. The index of this list is used as reference in the execution call-stack and location information to reference errors. Finally, execution-logs will return the lines logged to the console.

Using as example a script with this simple code:

(log-info "That's foo: " foo)
  (log-info "Here's bar: " bar)
  (define sources (for (i (range bar))
                    (create-source {"remote"
                                    "https://static.bigml.com/csv/iris.csv"})))
  (define source1 (sources 0))
  (define source2 (sources 1))
  (define division (/ foo bar)

these would be the outputs of each of the procedures described above:

(execution-inputs "execution/5d5c4a0eeba31d6280001ee2")
    ;; => [["bar" 2] ["foo" 6]]
  (execution-inputs "execution/5d5c4a0eeba31d6280001ee2" ["bar"])
    ;; => [["bar" 2]]
  (execution-outputs "execution/5d5c4a0eeba31d6280001ee2")
    ;; => [["division" 3.0 "Number"]
    ;;     ["sources"
    ;;     ["source/5d5c4a0ec5f953036601bd0b" "source/5d5c4a0ec5f953036e0333fb"]
    ;;      "list"]]
  (execution-output-resources "execution/5d5c578042129f7dfc00339d" ["source2"])
    ;; => [{"code" 5
    ;;      "id" "source/5d5c5780c5f953036d00ae11"
    ;;      "last_update" 1566332800805
    ;;      "progress" 1.0
    ;;      "state" "finished"
    ;;      "task" "Done"
    ;;      "variable" "source2"}]
  (execution-sources "execution/5d5c4a0eeba31d6280001ee2")
    ;; => [["script/5d5c4a04eba31d6280001edf" ""]]
  (execution-logs "execution/5d5c4a0eeba31d6280001ee2")
    ;; => [["info" "2019-08-20T19:29:18.239Z" 0 1 "That's foo: 6"]
    ;;     ["info" "2019-08-20T19:29:18.239Z" 0 2 "Here's bar: 2"]]