Organizations with the BigML Dashboard
Glossary
Admin a role assigned to a user in a given organization. If you are an admin, you can invite new users to the organization and assign the roles to those users. You can also create new projects in the organization, and fully access public and the private projects
Aggregation the process to gather information and represent it as a summary
Artificial Intelligence the science that focuses on making intelligent machines, especially intelligent computer programs
AIC (Akaike Information Criterion) measures the trade-off between the model goodness-of-fit and the model complexity to avoid overfitting. The lower the AIC, the better
AICc (Corrected Akaike Information Criterion) measures the same as AIC metric does, but it is a more accurate metric to avoid overfitting for small datasets. For larger datasets it yields same results as the AIC
Algorithm the procedures or formulas used for solving a problem
Anomaly Score an anomaly detector assigns an anomaly score to each instance of th input dataset. Additionally, you can use an anomaly detector to calculate the anomaly score of new data instances. An anomaly score is a percentage between 0% and 100% , with higher scores indicating higher anomaly
Anomaly Detection an unsupervised Machine Learning task which identifies instances in a dataset that do not conform to a regular pattern
Antecedent the left-hand-side itemset of an association rule
Argument a configurable parameter for any BigML resource
Association Discovery an unsupervised Machine Learning task to find out relationships between values in high-dimensional datasets. It is commonly used for market basket analysis
Bagging an ensemble based algorithm that uses a random subset of instances to generate each single tree
Batch Anomaly Score after you create an anomaly detection model you can score multiple instances with this option
Batch Topic Distribution a Batch topic distribution is created using a topic model and a dataset containing the instances (input data) for which you wish to obtain the topic probabilities
BIC (Schwarz Bayesian Information Criterion) measures the same trade-off as the AIC between the model goodness-of-fit and its complexity, but it penalizes more heavily the risk of overfitting. The lower the BIC, the better
Centroids the center of a cluster found by a clustering algorithm. Centroids are computed by using the mean for numeric fields and the mode for categorical fields. For text and items fields, it selects the values which minimizes the average cosine distance between the centroid and the points in its neighborhood
Cleansing the process of detecting and improving incorrect or incomplete parts of your raw data
Classification a modeling task whose objective field (i.e., the field being predicted) is categorical and predicts classes
Clustering an unsupervised Machine Learning task in which dataset instances are grouped into geometrically related subsets
Confidence an indicator of the prediction’s certainty for classification models and ensembles. It takes into account the class distribution and the number of instances at a certain node. It is a value between 0% and 100%
Confidence (Associations) the percentage of instances which contain the consequent and antecedent together over the number of instances which only contain the antecedent
Consequent the right-hand-side itemset of an association rule
Correlation a BigML resource which computes advanced statistics for the fields in your dataset by applying various exploratory data analysis techniques to compare the distributions of the fields in your dataset against an objective field
Coverage the support of the antecedent of an association rule, i.e., the portion of instances in the dataset which contain the antecedent itemset
K-fold cross-validation k-fold cross-validation automatically splits your dataset in k complementary subsets. One of the subsets is used for evaluation while the rest of the data is used for training the model (the remaining k - 1 subsets). This process is performed k times, each time using different parts of the data for training and testing the models. Finally, cross-validation yields k different models and k evaluations. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting a too optimistic testing dataset
Monte Carlo cross-validation Monte Carlo cross-validation, randomly splits n number of times the dataset into training and test subsets. For each split, the model is built using the training data and evaluated using the test data. The advantage of this method (over k-fold cross validation) is that the proportion of the training and validation subsets is not dependent on the number of iterations (k-folds). The disadvantage of this method is that some subsets may overlap while some instances may not be selected in any subset. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting a too optimistic testing dataset
Damping parameter a configurable paremeter of time series models to “dampen” the trend to a flat line at some point in the future so the trend does not grow indefinitely
Dashboard The BigML web-based interface that helps you privately navigate, visualize, and interact with your modeling resources
Dataset the structured version of a BigML source. It is used as input to build your predictive models. For each field in your dataset a number of basic statistics (min, max, mean, etc.) are parsed and produced as output
Decision Trees a class of Machine Learning algorithms used to solve regression and classification problems. Decision trees are composed of nodes and branches that create a model of decisions with a tree graph. Nodes represent the predictors or labels that have an influence in the predictive path, and the branches represent the rules followed by the algorithm to make a given prediction
Deepnets an optimized implementation of deep neural networks, a class of supervised learning algorithms, that can be used to solve regression and classification problems. The input features are fed to one or several groups “nodes”, each group of nodes form a “layer”. Each node is essentially a function on the input that transforms the input features into another value or collection of values. This process continues layer by layer, until we reach the final output (prediction), an array of per-class probabilities forclassification problems or a single, real value for regression problems
Denormalizing the strategy to increase the performance of a dataset by grouping data
Denormalization the strategy to increase the performance of a dataset by grouping data
Development Mode a free development operational mode (which we also call dev mode) that lets you perform any task with BigML at no cost when the maximum task size is 16MB
Discretization the process of transforming a numeric field into a categorical field
Discretization the process of transforming a numeric field into a categorical field
Early split when the dataset is bigger than 34 GB, BigML automatically takes a sample of your data and performs an early split so the model creation becomes significantly faster. It detects when an early split is safe by calculating the summary statistics collected at each node. Early splitting requires that the training data is shuffled beforehand to avoid generating inaccurate models caused by ordered fields in the input rows (as it will process the first x instances, then the next x ones and so on)
Ensembles a class of Machine Learning algorithms in which multiple independent classifiers or regressors are trained, and the combination of these classifiers is used to predict an objective field. An ensemble of models built on samples of the data can become a powerful predictor by averaging away the errors of each individual model
Entity the object or subject of interest in your modeling task. A dataset is a collection of instances of the entity of interest
Error represents the unpredictable variations in the time series data, and how they influence observed values. The error is one of the three components of time series models along with the trend and the seasonality
Evaluating the process to compute and estimate the performance of your model. Different performance measures are computed depending on the type of model (classification vs. regression models)
Evaluating the process to compute and estimate the performance of your model. Different performance measures are computed depending on the type of model (classification vs. regression models)
Evaluation a resource representing an assessment of the performance of a predictive model
Expected error an indicator of the prediction’s certainty for regression models and ensembles. It is the average of the predictions errors at a certain node
Feature Engineering the process of generating new features for a dataset so that Machine Learning algorithms will be more effective on that data. The features can either be transformations of existing features or entirely new information
Field an attribute of each instance in your data. Also called "feature", "covariate", or "predictor". Each field is associated with a type (numeric, categorical, text, items, or date-time)
Flatline a domain-specific lisp-like language that allows you to perform an infinite number of operations to create new fields or filter your BigML datasets. Furthermore, with the Flatline Editor you will be able to validate your Flatline expressions and preview the results from your Dashboard
Forecast the prediction of a time series model for future data points
Fusions supervised model that solves classification and regression problems by averaging the predictions of multiple models, ensembles, logistic regressions, and/or deepnets. Fusions are based on the same “wisdom of the crowds” principle than ensembles under which the combination of multiple models is often more performant than any of its individual members.
BigML Gallery a section of BigML to share, buy or sell datasets, models, and scripts.
Gaussian distribution a symmetric probability distribution in which the majority of the mass is clustered about a mean value and values increasingly far from the mean in either direction are increasingly unlikely. Also called the normal distribution or the bell curve
G-means a clustering algorithm that tries to learn the number of different clusters by iteratively taking existing clusters and testing whether the cluster’s neighborhood follow a Gaussian distribution
Histogram a bar chart-style visualization of a collection of values, in which the range of the values is broken up into a collection of ranges, and the height of a given bar increases as more points fall into the range associated with that bar
Isolation Forest the algorithm used to detect anomalies. This algorithm uses an ensemble of randomized trees to generate anomaly scores
Input the value used to parameterize WhizzML scripts, provided when the script is executed. It can be an atomic value such as a number, string, boolean or resource identifier (source id, model id, etc.), or a composite value such as a list or a map
Instances the data points that represent the entity you want to model, also known as observations or examples. They are usually the rows in your data with a value (potentially missing) for each field that describes the entity
Forecast interval a measure of the quality of the time series forecasts. The interval of the forecast for a future data point sets the upper and lower bounds in which the forecast may lay with a 95% confidence
Field importance a measure of each field importance for predicting the objective field relative to the other fields. It is computed averaging the error each field helps to reduce at every tree split
K-means
the canonical clustering algorithm, which attempts to fit a pre-specified number (k
) of clusters to the dataset
Labeled data the data including the objective field values that you are trying to predict. Labeled data is used for training supervised Machine Learning algorithms
Labs an experimental section that you can find in the top menu of your BigML account to quickly launch new features and prototypes, so users can test them before they are integrated in the Dashboard. The latest releases have the label “NEW”, and the features already integrated in the Dashboard that will disappear from the Labs section shortly have the label “LICENSED”
Learning the process to train your model with a given set of data. Usually, 80% of your dataset is used for training the model while the remaining 20% is set aside to validate the model
Level a weighted sum of all the preceding values in a time series model, with the weights being highest for the most recent values, and decreasing exponentially for past instances
Leverage the difference between the probability of the rule and the expected probability if the items were statistically independent
Lift how many times more often antecedent and consequent occur together than expected if they were statistically independent
Linear regression a popular technique from the fields of statistics that has been borrowed by Machine Learning to solve regression problems. Linear regression assumes the output variable, or the objective field, is a function of linear combination of the inputs
Local predictions the predictions made in your local environment, faster, at no cost, by downloading your model
Logistic regression another technique from the fields of statistics that has been borrowed by Machine Learning to solve classification problems. For each class of the objective field, logistic regression fits a logistic function to the training data. Logistic regression is a linear model, in the sense that it assumes the probability of a given class is a function of a weighted combination of the inputs
Admin permissions a level of permission assigned to a user in a private project in an organization. Admin permissions allow you to fully access the project resources and also invite other users and manage their permissions in the project
Member a role assigned to a user in a given organization. If you are a member, you can create new projects in the organization, fully access public projects and the private projects where you have at least read permissions
Missing value the data points that represent the entity you want to model may present missing value, i.e., not provide a value for all fields that compose the entity
Machine Learning a branch of Artificial Intelligence that explores data in order to find complex patterns that can be useful to make predictions. This is possible thanks to Machine Learning algorithms that can leverage the distributed computational power of present day cloud-based platforms
Model a single decision tree-like model when we refer to it in particular, and a predictive model when we refer to it in general
Cluster neighborhood the nearest points to the centroid which may finally conform a cluster
Node each split or rule in a decision tree. Each node tries to maximize the information gain in the case of classification models or minimize the mean squared error in the case of regression models
Node threshold the maximum number of nodes that a BigML model is allowed to grow
Non-preferred fields fields that, for a number of possible reasons, are by default not included in the modeling process. One example of this is fields that contain the same value for every instance; in general, constant fields add no information to the modeling process
Objective Field the field that a regression or classification model will predict (also known as target)
OptiML an automated optimization process for model selection and parametrization (or hyperparametrization) to solve classification and regression problems
Organization a collaborative workspace where all the users in the organization can access, work on, and visualize the same projects and resources in the BigML Dashboard. Furthermore, organizations enable you to define different roles and permissions for each user involved on your Machine Learning projects
Orthogonal
the default “one-hot” coding is orthogonal since a single instance can’t be two categories at the same time, so after recoding we also need to ensure there are not co-dependent coefficients. Orthogonality is met when the dot product for the codings equals 0, e.g., the following codings are orthogonal [1,1,-1,-1]
, [-1,1,0,0]
since (1)*(-1)+(1)*(1)+(-1)*(0)+(-1)*(0) = 0
Ouput one of the values computed by a WhizzML execution. Any variable defined in the executed script can be selected as an output, either in the script definition or in the execution request. It can be any valid WhizzML value (number, string, boolean, list, map, etc.)
Overfitting the process of tailoring the model to fit the training data at the expense of generalization
Owner a role assigned to a user in a given organization. There can only be one owner per organization. By default,the owner is the creator of the organization. If you are the owner you can manage the organization account (including the billing), invite new users to the organization and assign the roles to those users. You can also create new projects in the organization, and fully access public and the private projects
Prediction Path the series of rules that lead to a certain node in a decision tree
PCA Principal Component Analysis is an unsupervised Machine Learning technique used to transform a dataset in order to yield uncorrelated features and reduce dimensionality to build other models
Permissions the permissions in an organization regulate the actions that each user can take in each project. Basically, the edition of the project metadata, the ability to invite other users, and the access to the project resources. There are three levels of permissions, admin, write, and read permissions
Pivoting the process of generating new columns (which we also call fields or features) based on the distinct values of a previous column and the metric of another column
Predicate
a predicate is a statement that can be either true
or false
depending on the values of its component variables. BigML predicates may use the boolean operators =
, <=
, >=
, <
, >
, in
. Example of predicates are balance < 1,000
and field x = "category"
Predicting the process of obtaining the objective field value for your new data using an existing model. The model returns the predicted value along with a performance measure (confidence for classification or expected error for regression)
Predicting the result of obtaining the objective field value for your new data using an existing model. The model returns the predicted value along with a performance measure (confidence for classification or expected error for regression)
Predictive Model a machine-learned model that has been created using statistical learning. It can help describe or infer some statistical properties of an entity using the instances provided by a dataset
Predictors the fields your model uses as inputs to generate the set of rules to make predictions
Principal components the set of uncorrelated variables that PCA calculates as linear transformations of the original dataset field values. Each component has a variance associated which indicates the amount of the variability in the data that it captures
Private projects a privacy setting for projects in organizations. Members and restricted members con only access a private project if they have at least read permissions. The owner and admins of the organization can access all private projects with admin permissions
Production Mode an operational mode (which we also call prod mode) in which you are charged when you create BigML resources bigger than 16 MB. The maximum task size and the number of parallel tasks that you can run is determined by your subscription level
Project an abstract resource that helps you group related BigML resources together
Projections PCA predictions are called projections in BigML . For PCA, only batch projections are offered in the Dashboard, i.e., projections for multiple instances simultaneously. A batch projection is computed as the inner product of each instance in the components attribute of the PCA and the input vector
Public projects a privacy setting for projects in organizations. Anyone in a given organization can access a public project. All the users will have write permissions within public projects
Random Decision Forests an ensemble based algorithm which uses a random subset of features to generate anomaly scores
Read permissions a level of permission assigned to a user in a private project in an organization. Read permissions allow you to view the project resources
Regression a modeling task whose objective field (i.e., the field being predicted) is numeric
Resource any of the Machine Learning objects provided by BigML that can be used as a building block in the workflows needed to solve Machine Learning problems
Restricted member a role assigned to a user in a given organization. If you are a restricted member, you cannot create new projects in the organization. You can fully access public projects and the private projects where you have at least read permissions
Role the role in an organization regulates the actions that each user can take in the organization. Basically, the management of the organization account, and the access and ability to create projects. There are four different roles in the organization: owner, admin, member and restricted member
Sample a portion of your dataset that provides fast-access to the raw data on-demand basis
Sampling the process of partitioning your dataset to consider just a subset of your instances
Script a compiled source code written in WhizzML for automating Machine Learning workflows and implementing high-level algorithms
Seasonality a pattern of variation that takes over consecutive periods or fluctuations of fixed length. The seasonality is one of the three components of time series models along with the error and the trend
Source the BigML resource that represents the data source to which you wish to apply Machine Learning. A data source stores an arbitrarily-large collection of instances. A BigML source helps you ensure that your data is parsed correctly. The BigML preferred format for data sources is tabular data in which each row is used to represent one of the instances, and each column is used to represent a field of each instance
Statistical test a resource that automatically runs some advanced statistical tests on the numeric fields of a dataset. The goal of these tests is to check whether the values of individual fields conform or differ from some distribution patterns. Statistical test are useful in tasks such as fraud, normality, or outlier detection
Standard Deviation in the OptiML result, each model has a value for the optimization metric and a standard deviation associated. The standard deviation indicates the potential variation of the optimization metric depending on the random split of the dataset to train and evaluate the model. Therefore, the range of potential values for the optimized metric can be calculated +/- the standard deviation: if a model has a 70% accuracy with an standard deviation of 5% , the accuracy may take any value between 65% and 75% . The standard deviation is calculated taking into account the different values of the optimization metric achieved by the model during cross-validation
Supervised learning a type of Machine Learning problem in which each instance of the data has a label. The label for each instance is provided in the training data, and a supervised Machine Learning algorithm learns a function or model that will predict the label given all other features in the data. The function can then be applied to data unseen during training to predict the label for unlabeled instances
Support
the proportion of instances in the dataset which contain an itemset. The support of an association is the portion of instances in the dataset which contain the rule’s antecedent and rule’s consequent together over the total number of instances (N
) in the dataset
Tag cloud a visualization of a text field in which each term is sized according to the number of instances in which it appeared in that field
Task the process of creating a BigML resource, such as creating a dataset, or training a model. A given task can also create subtasks, as, in the case of a WhizzML script that contains calls to create other resources
Time series a sequentially indexed representation of your historical data that can be used to forecasting future values of numerical properties. BigML implements exponential smoothing where the smoothing parameters assign exponentially increasing weights to most recent instances. Exponential smoothing methods allow the modelization of data with trend and seasonal patterns
Tokenization the strategy to split the text into several unique values
Topic the output of a topic model. Each topic is a distribution over terms that are thematically related. Each term has a different probability within a topic: the higher the probability, the more relevant is a term for that topic
Topic Distribution a topic distribution is created using a topic model and the new text (input data) for which you wish to obtain the topic probabilities
Topic Model an unsupervised Machine Learning task which identifies the relevant topics in the dataset text fields. Topic models in BigML are an optimized implementation of the Latent Dirichlet Allocation algorithm, a probabilistic method to find topics in large archive of documents
Tree a data structure that can be described as a collection of nodes, starting at a Root node, where each node may recursively have a number of child nodes. Nodes that have no childrens are called leaves
Root the node from which a Tree originates
Leaf a terminal node in a Tree, i.e., a node that has no children
Trend represents the long term trajectory of the time-based data. The trend is one of the three components of time series models along with the error and the seasonality
Unlabeled data the data without an objective field information. These data is used for unsupervised learning as results are obtained from the data patterns without needing any specific target values
Unsupervised learning a type of Machine Learning problem in which the objective is not to learn a predictor, and thus does not require each instance to be labeled. Typically, unsupervised learning algorithms infer some summarizing structure over the dataset, such as a clustering or a set of association rules
Variance (PCA) the variance of each component in PCA means the total variability of the data explained by that component. A higher variance for a given component makes it a better candidate to select as input for other models
WhizzML BigML domain-specific language for automating complex Machine Learning workflows and implementing high-level algorithms
Data Wrangling the process of converting or mapping data from one “raw” form into another format that allows a more convenient use of the data
Write permissions a level of permission assigned to a user in a private project in an organization. Write permissions allow you to fully access the project resources, i.e, create, update, move and delete resources