Organizations with the BigML Dashboard

Glossary

Admin a role assigned to a user in a given organization. If you are an admin, you can invite new users to the organization and assign the roles to those users. You can also create new projects in the organization, and fully access public and the private projects

Aggregation the process to gather information and represent it as a summary

Artificial Intelligence the science that focuses on making intelligent machines, especially intelligent computer programs

AIC (Akaike Information Criterion) measures the trade-off between the model goodness-of-fit and the model complexity to avoid overfitting. The lower the AIC, the better

AICc (Corrected Akaike Information Criterion) measures the same as AIC metric does, but it is a more accurate metric to avoid overfitting for small datasets. For larger datasets it yields same results as the AIC

Algorithm the procedures or formulas used for solving a problem

Anomaly Score an anomaly detector assigns an anomaly score to each instance of th input dataset. Additionally, you can use an anomaly detector to calculate the anomaly score of new data instances. An anomaly score is a percentage between 0% and 100% , with higher scores indicating higher anomaly

Anomaly Detection an unsupervised Machine Learning task which identifies instances in a dataset that do not conform to a regular pattern

Antecedent the left-hand-side itemset of an association rule

Argument a configurable parameter for any BigML resource

Association Discovery an unsupervised Machine Learning task to find out relationships between values in high-dimensional datasets. It is commonly used for market basket analysis

Bagging an ensemble based algorithm that uses a random subset of instances to generate each single tree

Batch Anomaly Score after you create an anomaly detection model you can score multiple instances with this option

Batch Topic Distribution a Batch topic distribution is created using a topic model and a dataset containing the instances (input data) for which you wish to obtain the topic probabilities

BIC (Schwarz Bayesian Information Criterion) measures the same trade-off as the AIC between the model goodness-of-fit and its complexity, but it penalizes more heavily the risk of overfitting. The lower the BIC, the better

Centroids the center of a cluster found by a clustering algorithm. Centroids are computed by using the mean for numeric fields and the mode for categorical fields. For text and items fields, it selects the values which minimizes the average cosine distance between the centroid and the points in its neighborhood

Cleansing the process of detecting and improving incorrect or incomplete parts of your raw data

Classification a modeling task whose objective field (i.e., the field being predicted) is categorical and predicts classes

Clustering an unsupervised Machine Learning task in which dataset instances are grouped into geometrically related subsets

Confidence an indicator of the prediction’s certainty for classification models and ensembles. It takes into account the class distribution and the number of instances at a certain node. It is a value between 0% and 100%

Confidence (Associations) the percentage of instances which contain the consequent and antecedent together over the number of instances which only contain the antecedent

Consequent the right-hand-side itemset of an association rule

Correlation a BigML resource which computes advanced statistics for the fields in your dataset by applying various exploratory data analysis techniques to compare the distributions of the fields in your dataset against an objective field

Coverage the support of the antecedent of an association rule, i.e., the portion of instances in the dataset which contain the antecedent itemset

K-fold cross-validation k-fold cross-validation automatically splits your dataset in k complementary subsets. One of the subsets is used for evaluation while the rest of the data is used for training the model (the remaining k - 1 subsets). This process is performed k times, each time using different parts of the data for training and testing the models. Finally, cross-validation yields k different models and k evaluations. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting a too optimistic testing dataset

Monte Carlo cross-validation Monte Carlo cross-validation, randomly splits n number of times the dataset into training and test subsets. For each split, the model is built using the training data and evaluated using the test data. The advantage of this method (over k-fold cross validation) is that the proportion of the training and validation subsets is not dependent on the number of iterations (k-folds). The disadvantage of this method is that some subsets may overlap while some instances may not be selected in any subset. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting a too optimistic testing dataset

Damping parameter a configurable paremeter of time series models to “dampen” the trend to a flat line at some point in the future so the trend does not grow indefinitely

Dashboard The BigML web-based interface that helps you privately navigate, visualize, and interact with your modeling resources

Dataset the structured version of a BigML source. It is used as input to build your predictive models. For each field in your dataset a number of basic statistics (min, max, mean, etc.) are parsed and produced as output

Decision Trees a class of Machine Learning algorithms used to solve regression and classification problems. Decision trees are composed of nodes and branches that create a model of decisions with a tree graph. Nodes represent the predictors or labels that have an influence in the predictive path, and the branches represent the rules followed by the algorithm to make a given prediction

Deepnets an optimized implementation of deep neural networks, a class of supervised learning algorithms, that can be used to solve regression and classification problems. The input features are fed to one or several groups “nodes”, each group of nodes form a “layer”. Each node is essentially a function on the input that transforms the input features into another value or collection of values. This process continues layer by layer, until we reach the final output (prediction), an array of per-class probabilities forclassification problems or a single, real value for regression problems

Denormalizing the strategy to increase the performance of a dataset by grouping data

Denormalization the strategy to increase the performance of a dataset by grouping data

Development Mode a free development operational mode (which we also call dev mode) that lets you perform any task with BigML at no cost when the maximum task size is 16MB

Discretization the process of transforming a numeric field into a categorical field

Early split when the dataset is bigger than 34 GB, BigML automatically takes a sample of your data and performs an early split so the model creation becomes significantly faster. It detects when an early split is safe by calculating the summary statistics collected at each node. Early splitting requires that the training data is shuffled beforehand to avoid generating inaccurate models caused by ordered fields in the input rows (as it will process the first x instances, then the next x ones and so on)

Ensembles a class of Machine Learning algorithms in which multiple independent classifiers or regressors are trained, and the combination of these classifiers is used to predict an objective field. An ensemble of models built on samples of the data can become a powerful predictor by averaging away the errors of each individual model

Entity the object or subject of interest in your modeling task. A dataset is a collection of instances of the entity of interest

Error represents the unpredictable variations in the time series data, and how they influence observed values. The error is one of the three components of time series models along with the trend and the seasonality

Evaluating the process to compute and estimate the performance of your model. Different performance measures are computed depending on the type of model (classification vs. regression models)

Evaluation a resource representing an assessment of the performance of a predictive model

Expected error an indicator of the prediction’s certainty for regression models and ensembles. It is the average of the predictions errors at a certain node

Feature Engineering the process of generating new features for a dataset so that Machine Learning algorithms will be more effective on that data. The features can either be transformations of existing features or entirely new information

Field an attribute of each instance in your data. Also called "feature", "covariate", or "predictor". Each field is associated with a type (numeric, categorical, text, items, or date-time)

Flatline a domain-specific lisp-like language that allows you to perform an infinite number of operations to create new fields or filter your BigML datasets. Furthermore, with the Flatline Editor you will be able to validate your Flatline expressions and preview the results from your Dashboard

Forecast the prediction of a time series model for future data points

Fusions supervised model that solves classification and regression problems by averaging the predictions of multiple models, ensembles, logistic regressions, and/or deepnets. Fusions are based on the same “wisdom of the crowds” principle than ensembles under which the combination of multiple models is often more performant than any of its individual members.

BigML Gallery a section of BigML to share, buy or sell datasets, models, and scripts.

Gaussian distribution a symmetric probability distribution in which the majority of the mass is clustered about a mean value and values increasingly far from the mean in either direction are increasingly unlikely. Also called the normal distribution or the bell curve

G-means a clustering algorithm that tries to learn the number of different clusters by iteratively taking existing clusters and testing whether the cluster’s neighborhood follow a Gaussian distribution

Histogram a bar chart-style visualization of a collection of values, in which the range of the values is broken up into a collection of ranges, and the height of a given bar increases as more points fall into the range associated with that bar

Isolation Forest the algorithm used to detect anomalies. This algorithm uses an ensemble of randomized trees to generate anomaly scores

Input the value used to parameterize WhizzML scripts, provided when the script is executed. It can be an atomic value such as a number, string, boolean or resource identifier (source id, model id, etc.), or a composite value such as a list or a map

Instances the data points that represent the entity you want to model, also known as observations or examples. They are usually the rows in your data with a value (potentially missing) for each field that describes the entity

Forecast interval a measure of the quality of the time series forecasts. The interval of the forecast for a future data point sets the upper and lower bounds in which the forecast may lay with a 95% confidence

Field importance a measure of each field importance for predicting the objective field relative to the other fields. It is computed averaging the error each field helps to reduce at every tree split

K-means the canonical clustering algorithm, which attempts to fit a pre-specified number (k) of clusters to the dataset

Labeled data the data including the objective field values that you are trying to predict. Labeled data is used for training supervised Machine Learning algorithms

Labs an experimental section that you can find in the top menu of your BigML account to quickly launch new features and prototypes, so users can test them before they are integrated in the Dashboard. The latest releases have the label “NEW”, and the features already integrated in the Dashboard that will disappear from the Labs section shortly have the label “LICENSED”

Learning the process to train your model with a given set of data. Usually, 80% of your dataset is used for training the model while the remaining 20% is set aside to validate the model

Level a weighted sum of all the preceding values in a time series model, with the weights being highest for the most recent values, and decreasing exponentially for past instances

Leverage the difference between the probability of the rule and the expected probability if the items were statistically independent

Lift how many times more often antecedent and consequent occur together than expected if they were statistically independent

Linear regression a popular technique from the fields of statistics that has been borrowed by Machine Learning to solve regression problems. Linear regression assumes the output variable, or the objective field, is a function of linear combination of the inputs

Local predictions the predictions made in your local environment, faster, at no cost, by downloading your model

Logistic regression another technique from the fields of statistics that has been borrowed by Machine Learning to solve classification problems. For each class of the objective field, logistic regression fits a logistic function to the training data. Logistic regression is a linear model, in the sense that it assumes the probability of a given class is a function of a weighted combination of the inputs

Admin permissions a level of permission assigned to a user in a private project in an organization. Admin permissions allow you to fully access the project resources and also invite other users and manage their permissions in the project

Member a role assigned to a user in a given organization. If you are a member, you can create new projects in the organization, fully access public projects and the private projects where you have at least read permissions

Missing value the data points that represent the entity you want to model may present missing value, i.e., not provide a value for all fields that compose the entity

Machine Learning a branch of Artificial Intelligence that explores data in order to find complex patterns that can be useful to make predictions. This is possible thanks to Machine Learning algorithms that can leverage the distributed computational power of present day cloud-based platforms

Model a single decision tree-like model when we refer to it in particular, and a predictive model when we refer to it in general

Cluster neighborhood the nearest points to the centroid which may finally conform a cluster

Node each split or rule in a decision tree. Each node tries to maximize the information gain in the case of classification models or minimize the mean squared error in the case of regression models

Node threshold the maximum number of nodes that a BigML model is allowed to grow

Non-preferred fields fields that, for a number of possible reasons, are by default not included in the modeling process. One example of this is fields that contain the same value for every instance; in general, constant fields add no information to the modeling process

Objective Field the field that a regression or classification model will predict (also known as target)

OptiML an automated optimization process for model selection and parametrization (or hyperparametrization) to solve classification and regression problems

Organization a collaborative workspace where all the users in the organization can access, work on, and visualize the same projects and resources in the BigML Dashboard. Furthermore, organizations enable you to define different roles and permissions for each user involved on your Machine Learning projects

Orthogonal the default “one-hot” coding is orthogonal since a single instance can’t be two categories at the same time, so after recoding we also need to ensure there are not co-dependent coefficients. Orthogonality is met when the dot product for the codings equals 0, e.g., the following codings are orthogonal [1,1,-1,-1], [-1,1,0,0] since (1)*(-1)+(1)*(1)+(-1)*(0)+(-1)*(0) = 0

Ouput one of the values computed by a WhizzML execution. Any variable defined in the executed script can be selected as an output, either in the script definition or in the execution request. It can be any valid WhizzML value (number, string, boolean, list, map, etc.)

Overfitting the process of tailoring the model to fit the training data at the expense of generalization

Owner a role assigned to a user in a given organization. There can only be one owner per organization. By default,the owner is the creator of the organization. If you are the owner you can manage the organization account (including the billing), invite new users to the organization and assign the roles to those users. You can also create new projects in the organization, and fully access public and the private projects

Prediction Path the series of rules that lead to a certain node in a decision tree

PCA Principal Component Analysis is an unsupervised Machine Learning technique used to transform a dataset in order to yield uncorrelated features and reduce dimensionality to build other models

Permissions the permissions in an organization regulate the actions that each user can take in each project. Basically, the edition of the project metadata, the ability to invite other users, and the access to the project resources. There are three levels of permissions, admin, write, and read permissions

Pivoting the process of generating new columns (which we also call fields or features) based on the distinct values of a previous column and the metric of another column

Predicate a predicate is a statement that can be either true or false depending on the values of its component variables. BigML predicates may use the boolean operators =, <=, >=, <, >, in. Example of predicates are balance < 1,000 and field x = "category"

Predicting the process of obtaining the objective field value for your new data using an existing model. The model returns the predicted value along with a performance measure (confidence for classification or expected error for regression)

Predicting the result of obtaining the objective field value for your new data using an existing model. The model returns the predicted value along with a performance measure (confidence for classification or expected error for regression)

Predictive Model a machine-learned model that has been created using statistical learning. It can help describe or infer some statistical properties of an entity using the instances provided by a dataset

Predictors the fields your model uses as inputs to generate the set of rules to make predictions

Principal components the set of uncorrelated variables that PCA calculates as linear transformations of the original dataset field values. Each component has a variance associated which indicates the amount of the variability in the data that it captures

Private projects a privacy setting for projects in organizations. Members and restricted members con only access a private project if they have at least read permissions. The owner and admins of the organization can access all private projects with admin permissions

Production Mode an operational mode (which we also call prod mode) in which you are charged when you create BigML resources bigger than 16 MB. The maximum task size and the number of parallel tasks that you can run is determined by your subscription level

Project an abstract resource that helps you group related BigML resources together

Projections PCA predictions are called projections in BigML . For PCA, only batch projections are offered in the Dashboard, i.e., projections for multiple instances simultaneously. A batch projection is computed as the inner product of each instance in the components attribute of the PCA and the input vector

Public projects a privacy setting for projects in organizations. Anyone in a given organization can access a public project. All the users will have write permissions within public projects

Random Decision Forests an ensemble based algorithm which uses a random subset of features to generate anomaly scores

Read permissions a level of permission assigned to a user in a private project in an organization. Read permissions allow you to view the project resources

Regression a modeling task whose objective field (i.e., the field being predicted) is numeric

Resource any of the Machine Learning objects provided by BigML that can be used as a building block in the workflows needed to solve Machine Learning problems

Restricted member a role assigned to a user in a given organization. If you are a restricted member, you cannot create new projects in the organization. You can fully access public projects and the private projects where you have at least read permissions

Role the role in an organization regulates the actions that each user can take in the organization. Basically, the management of the organization account, and the access and ability to create projects. There are four different roles in the organization: owner, admin, member and restricted member

Sample a portion of your dataset that provides fast-access to the raw data on-demand basis

Sampling the process of partitioning your dataset to consider just a subset of your instances

Script a compiled source code written in WhizzML for automating Machine Learning workflows and implementing high-level algorithms

Seasonality a pattern of variation that takes over consecutive periods or fluctuations of fixed length. The seasonality is one of the three components of time series models along with the error and the trend

Source the BigML resource that represents the data source to which you wish to apply Machine Learning. A data source stores an arbitrarily-large collection of instances. A BigML source helps you ensure that your data is parsed correctly. The BigML preferred format for data sources is tabular data in which each row is used to represent one of the instances, and each column is used to represent a field of each instance

Statistical test a resource that automatically runs some advanced statistical tests on the numeric fields of a dataset. The goal of these tests is to check whether the values of individual fields conform or differ from some distribution patterns. Statistical test are useful in tasks such as fraud, normality, or outlier detection

Standard Deviation in the OptiML result, each model has a value for the optimization metric and a standard deviation associated. The standard deviation indicates the potential variation of the optimization metric depending on the random split of the dataset to train and evaluate the model. Therefore, the range of potential values for the optimized metric can be calculated +/- the standard deviation: if a model has a 70% accuracy with an standard deviation of 5% , the accuracy may take any value between 65% and 75% . The standard deviation is calculated taking into account the different values of the optimization metric achieved by the model during cross-validation

Supervised learning a type of Machine Learning problem in which each instance of the data has a label. The label for each instance is provided in the training data, and a supervised Machine Learning algorithm learns a function or model that will predict the label given all other features in the data. The function can then be applied to data unseen during training to predict the label for unlabeled instances

Support the proportion of instances in the dataset which contain an itemset. The support of an association is the portion of instances in the dataset which contain the rule’s antecedent and rule’s consequent together over the total number of instances (N) in the dataset

Tag cloud a visualization of a text field in which each term is sized according to the number of instances in which it appeared in that field

Task the process of creating a BigML resource, such as creating a dataset, or training a model. A given task can also create subtasks, as, in the case of a WhizzML script that contains calls to create other resources

Time series a sequentially indexed representation of your historical data that can be used to forecasting future values of numerical properties. BigML implements exponential smoothing where the smoothing parameters assign exponentially increasing weights to most recent instances. Exponential smoothing methods allow the modelization of data with trend and seasonal patterns

Tokenization the strategy to split the text into several unique values

Topic the output of a topic model. Each topic is a distribution over terms that are thematically related. Each term has a different probability within a topic: the higher the probability, the more relevant is a term for that topic

Topic Distribution a topic distribution is created using a topic model and the new text (input data) for which you wish to obtain the topic probabilities

Topic Model an unsupervised Machine Learning task which identifies the relevant topics in the dataset text fields. Topic models in BigML are an optimized implementation of the Latent Dirichlet Allocation algorithm, a probabilistic method to find topics in large archive of documents

Tree a data structure that can be described as a collection of nodes, starting at a Root node, where each node may recursively have a number of child nodes. Nodes that have no childrens are called leaves

Root the node from which a Tree originates

Leaf a terminal node in a Tree, i.e., a node that has no children

Trend represents the long term trajectory of the time-based data. The trend is one of the three components of time series models along with the error and the seasonality

Unlabeled data the data without an objective field information. These data is used for unsupervised learning as results are obtained from the data patterns without needing any specific target values

Unsupervised learning a type of Machine Learning problem in which the objective is not to learn a predictor, and thus does not require each instance to be labeled. Typically, unsupervised learning algorithms infer some summarizing structure over the dataset, such as a clustering or a set of association rules

Variance (PCA) the variance of each component in PCA means the total variability of the data explained by that component. A higher variance for a given component makes it a better candidate to select as input for other models

WhizzML BigML domain-specific language for automating complex Machine Learning workflows and implementing high-level algorithms

Data Wrangling the process of converting or mapping data from one “raw” form into another format that allows a more convenient use of the data

Write permissions a level of permission assigned to a user in a private project in an organization. Write permissions allow you to fully access the project resources, i.e, create, update, move and delete resources