Recommended algorithm components
Recommended algorithm components include common general algorithms (such as data reading algorithms, SQL scripts, Python scripts) and LLM data processing algorithms (such as LLM data processing, LVM data processing), along with LLM training and inference algorithms. We recommend DLC-based algorithm components, which support heterogeneous resources and user-defined environments for more flexible usage.
Type | Component | Description | ||
Custom components | You can create custom components in AI computing asset management. Then, use it together with official components in Designer. | |||
Data source/target | Reads files or directories from Object Storage Service (OSS) buckets. | |||
Reads CSV files from OSS, HTTP, and HDFS. | ||||
Reads data from MaxCompute tables, by default, in the current project. | ||||
Writes upstream data to MaxCompute. | ||||
User defined script | A custom SQL component that allows you to write SQL statements in an editor and submit them to MaxCompute for execution. | |||
Defines dependencies and runs custom Python functions. | ||||
Tools | Dataset Register | Registers datasets to AI asset management. | ||
Model Register | Registers models to AI asset management. | |||
Update EAS Service (Beta) | Calls eascmd to update the specified EAS service. The service to be updated must be in the running state. A new service version will be created each time. | |||
Large model data preprocessing | Data conversion | Imports MaxCompute tables to OSS. | ||
Imports data from OSS to MaxCompute tables. | ||||
LLM data processing (DLC) | Calculates the MD5 hash values of text and deduplicate text based on hash values. | |||
Normalizes Unicode text and converts traditional Chinese to simplified Chinese. | ||||
Removes URLs from text. It can also remove HTML format characters and parse HTML text. | ||||
Filters samples based on the proportion of special characters, keeping samples within the specified ratio range. | ||||
Deletes copyright information from text, often used to remove header copyright comments from code text. | ||||
Filters samples based on the ratio of numbers and alphabetic characters. | ||||
Filters samples based on text length, average length, maximum line length, etc. | ||||
Identifies the language of text and calculates scores. Then, it filters samples based on the language and score. | ||||
Filters out samples containing sensitive words. | ||||
Masks sensitive information, such as replacing email addresses with [EMAIL], phone/telephone numbers with [TELEPHONE] or [MOBILEPHONE], and ID card numbers with [IDNUM]. | ||||
Calculates similarity between texts using the SimHash algorithm to achieve text deduplication. | ||||
Keeps samples with character-level or word-level N-Gram repetition ratios within the specified range. | ||||
Used for TEX document format data. It performs inline expansion on all macros without parameters. If a macro consists of letters and numbers and has no parameters, the macro name is replaced with the macro value. | ||||
Used for TEX document format data. It deletes bibliographies at the end of LaTeX format text. | ||||
Used for TEX document format data. It deletes comment lines and inline comments in LaTeX format text. | ||||
Used for TEX document format data. It finds the first string matching the <section-type>[optional-args]{name} chapter format, and deletes all content before it, keeping all content after the first matched chapter, including the chapter title. | ||||
LLM data processing (MaxCompute) | Used for text data preprocessing for LLMs. It calculates MD5 hash values of text and deduplicates text based on hash values. | |||
Used for text data preprocessing for LLMs. It normalizes Unicode text and converts traditional Chinese to simplified Chinese. | ||||
Used for text data preprocessing for LLMs. It removes special content from text, such as navigation information, author information, article source information, URL links, invisible characters, remove HTML format characters and parses HTML text, etc. | ||||
Used for text data preprocessing for LLMs. It filters samples based on special character ratio, keeping samples where the proportion of special characters to total text length is within the specified range. | ||||
Used for text data preprocessing for LLMs. It deletes copyright information from text, often used to remove header copyright comments from code text. | ||||
Used for text data preprocessing for LLMs. It filters samples based on the count of letters, numbers, and separators. | ||||
Used for text data preprocessing for LLMs. It filters samples based on text length, average length, maximum line length, etc. Average length and maximum line length filtering will by default split the text by line before calculating statistics. | ||||
LLM-Text Quality Predict and Language Identification-FastText (MaxCompute) | Used for text data preprocessing for LLMs. It identifies the language of text and calculates scores, and can filter samples based on language and score. | |||
Used for text data preprocessing for LLMs. It filters out samples containing sensitive words. | ||||
Used for text data preprocessing for LLMs. It masks sensitive information, such as replacing email addresses with | ||||
Used for text data preprocessing for LLMs. It deduplicates sentences within an article. | ||||
Used for text data preprocessing for LLMs. It keeps samples with character-level or word-level N-Gram repetition ratios within the specified range. | ||||
Used for text data preprocessing for LLMs, suitable for TEX document format data. It performs inline expansion on all macros without parameters. If a macro consists of letters and numbers and has no parameters, the macro name is replaced with the macro value. | ||||
Used for text data preprocessing for LLMs, suitable for TEX document format data. It deletes bibliographies at the end of LaTeX format text. | ||||
Used for text data preprocessing for LLMs, suitable for TEX document format data. It deletes comment lines and inline comments in LaTeX format text. | ||||
Used for text data preprocessing for LLMs, suitable for TEX document format data. It finds the first string matching the <section-type>[optional-args]{name} chapter format, and deletes all content before it, keeping all content after the first matched chapter, including the chapter title. | ||||
LVM data processing (DLC) | Video data preprocessing | Filters video data with excessive text. it is particularly suitable for video editing and content moderation scenarios, helping users automatically identify and process video segments containing too much text, thereby improving work efficiency. | ||
Filters video data with too fast or too slow motion. | ||||
Filters video data with low aesthetic scores. | ||||
Filters video data with aspect ratios that are too large or too small. | ||||
Filters video data with durations that are too long or too short. | ||||
Filters video data with low similarity scores. | ||||
Filters video data with high NSFW scores. | ||||
Filters video data with resolutions that are too high or too low. | ||||
Filters video data with watermarks. | ||||
Filters video data that does not match specified tags. | ||||
Calculates tags for video frames. | ||||
Generates text for videos. | ||||
Generates text for videos. | ||||
Image data preprocessing | Filters image data with low aesthetic scores. | |||
Filters image data with aspect ratios that are too large or too small. | ||||
Filters image data with face proportions that are too large or too small. | ||||
Filters image data with high NSFW scores. | ||||
Filters image data with resolutions that are too high or too low. | ||||
Filters image data that is too large or too small. | ||||
Filters image data with low text-image match scores. | ||||
Filters image data with low text-image similarity scores. | ||||
Filters image data with watermarks. | ||||
Generates natural language descriptions for input images. | ||||
Large model training and inference | Supports some LLMs from PAI-Model Gallery. | |||
Supports some LLMs from PAI-Model Gallery, converting online inference to offline inference. | ||||
Used for BERT model offline inference, utilizing trained BERT classification models to classify text in the input table. |
Traditional algorithm components
Traditional algorithm components are early developed algorithms that have not been updated for a long time. We cannot guarantee their stability. If you have to use them in a production environment, first evaluate their applicability. If they are already used in production, replace them with preferred components as soon as possible.
Type | Component | Description |
Data preprocessing | Performs random independent sampling on the input according to a given proportion or number. | |
Generates sampling data based on the values of weighted columns. | ||
Filters data based on expressions, and you can modify the output field names. | ||
Given a grouping column, it divides the input data into different groups based on the different values of these columns, and performs random sampling separately within each group. | ||
Merges two tables by associating the columns in the tables and determines the output fields. It works like the JOIN statement of SQL. | ||
Merges two tables by column. The two tables must have the same number of rows, otherwise an error will occur. If only one of the two tables has partitions, the partitioned table needs to connect to the second input port. | ||
Merges two tables by row. The numbers and data types of the output fields selected from the left and right tables must be the same. This component integrates the features of UNION and UNION ALL. | ||
Converts features of any data type to STRING, DOUBLE, and INT features, and supports missing value filling when conversion exceptions occur. | ||
Appends an ID column to the first column of a data table. | ||
Randomly splits data to generate training and test datasets. | ||
Handles missing data in datasets. You can configure the parameters of this component in the console or PAI commands. | ||
Normalizes dense data or sparse data. | ||
Generates standardized instances in the console or by running PAI commands. | ||
Converts a table in KV (Key:Value) format into a standard table format. | ||
Converts a standard table into a KV (Key:Value) format table, in the console or by running PAI commands. | ||
Feature engineering | Provides filtering functionality for components such as linear feature importance, GBDT feature importance, and random forest feature importance, supporting the filtering of TopN features. | |
A multivariate statistical method that studies how to reveal the internal structure between multiple variables through a small number of principal components, examining the correlation among multiple variables. | ||
Performs common scaling transformations on numeric features in dense or sparse format. | ||
Discretizes continuous features based on a specific rule. | ||
Smooths anomalous data in input features to a specific interval, supporting both sparse and dense data formats. | ||
An important matrix decomposition in linear algebra, which is a generalization of the diagonalization of normal matrices in matrix analysis. | ||
Detects data with continuous and enumeration features. It helps you identify anomalous points in your data. | ||
Includes linear regression and binary logistic regression, and supports both sparse and dense data formats. | ||
Analyzes the distribution of discrete features. | ||
Calculates feature importance. | ||
Selects and filters the top N feature data from all sparse or dense format feature data based on the different feature selection methods you use. | ||
Encodes nonlinear features into linear features through GBDT. | ||
Converts data into sparse data, and the output result is also in a sparse key-value structure. | ||
Statistical analysis | Helps you visually understand the distribution of features and label columns along with the characteristics of features, which facilitates subsequent data analysis. | |
Measures the joint variability of two variables. | ||
Uses empirical distribution and kernel distribution algorithms to estimate the probability density of sample data. | ||
Collecta statistics about data in a table or only selected columns. | ||
Used in scenarios where variables are categorical variables. It aims to test whether the actual observed frequency and theoretical frequency are consistent across classifications of a single multinomial categorical variable. The null hypothesis is that there is no difference between the observed frequency and theoretical frequency. | ||
A box plot chart is a statistical graph used to display the dispersion of a set of data. It is mainly used to reflect the distribution characteristics of the original data, and can also be used to compare the distribution characteristics of multiple sets of data. | ||
In regression analysis, a scatter plot shows the distribution of data points in a Cartesian coordinate system. | ||
The correlation coefficient algorithm is used to calculate the correlation coefficient between each column in a matrix, with values ranging from [-1,1]. When the system calculates, the count is based on the number of elements that are simultaneously non-empty between two columns, which may differ between different column pairs. | ||
Based on statistical principles, it tests whether there is a significant difference between the means of two samples. | ||
Tests whether there is a significant difference between the overall mean of a variable and a specified value. The sample being tested must follow a normal distribution overall. | ||
Determines whether the population follows normal distribution by using observations. It is an important special type of goodness-of-fit hypothesis test in statistical decision-making. | ||
Helps you see the income distribution of a country or region. | ||
A statistical term used to calculate the percentile of column data in a data table. | ||
A linear correlation coefficient that measures the linear correlation between two variables. | ||
Also known as a mass distribution chart, a statistical reporting graph that uses a series of vertical bars or line segments of varying heights to represent data distribution. | ||
Machine learning | Uses the training model and prediction data as input and generates prediction results as output. | |
An extension and upgrade based on the boosting algorithm, with better usability and robustness, widely used in various machine learning production systems and competition fields. It currently supports classification and regression. | ||
An extension and upgrade based on the boosting algorithm, with better usability and robustness, widely used in various machine learning production systems and competition fields. It currently supports classification and regression. | ||
A machine learning method based on statistical learning theory. It improves the generalization ability of the learning machine by seeking structural risk minimization, thereby achieving minimization of empirical risk and confidence range. | ||
A binary classification algorithm that supports both sparse and dense data formats. | ||
This component works by setting a threshold. If the feature value is greater than the threshold, it is classified as a positive sample. Otherwise, it is classified as a negative sample. | ||
The parameter server PS (Parameter Server) is dedicated to solving large-scale offline and online training tasks. SMART (Scalable Multiple Additive Regression Tree) is an iterative algorithm implemented by GBDT (Gradient Boosting Decision Tree) based on PS. | ||
A classic binary classification algorithm widely used in advertising and search scenarios. | ||
The parameter server PS is dedicated to solving large-scale offline and online training tasks. SMART is an iteration algorithm implemented based on PS for GBDT. | ||
Selects the K records with the closest distance from the training table for each row of data in the prediction table, and uses the class with the largest number of categories among these K records as the class of that row. | ||
A binary classification algorithm. The logistic regression provided by PAI supports multiclass classification and both sparse and dense data formats. | ||
A classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees. | ||
A probabilistic classification algorithm based on Bayes' theorem with independence assumptions. | ||
Randomly selects K objects as the initial clustering centers for each cluster, then calculates the distance between the remaining objects and each cluster center, assigns them to the nearest cluster, and recalculates the clustering center for each cluster. | ||
Builds clustering models. | ||
Implements model classification. | ||
Predicts the cluster to which new point data belongs based on the DBSCAN training model. | ||
Performs clustering prediction based on trained Gaussian mixture models. | ||
An iterative decision tree algorithm, suitable for linear and nonlinear regression scenarios. | ||
A model that analyzes the linear relationship between a dependent variable and multiple independent variables. | ||
Solves large-scale offline and online training tasks. SMART is an iterative algorithm implemented based on PS for GBDT. | ||
A model that analyzes the linear relationship between a dependent variable and multiple independent variables. The PS is dedicated to solving large-scale offline and online training tasks. | ||
Calculates AUC, KS, and F1 Score metrics to generate KS curves, PR curves, ROC curves, LIFT Chart, and Gain Chart. | ||
Evaluates the quality of regression algorithm models based on prediction results and original results, and outputs evaluation metrics and residual histograms. | ||
Evaluates the quality of clustering models based on the original data and clustering results, and outputs evaluation metrics. | ||
Suitable for supervised learning and corresponds to the matching matrix in unsupervised learning. | ||
Evaluates the advantages and disadvantages of multiclass classification algorithm models based on the prediction results and original results of classification models, and outputs evaluation metrics (such as Accuracy, Kappa, and F1-Score). | ||
Deep learning | PAI supports deep learning frameworks. You can use these frameworks and hardware resources to implement deep learning algorithms. | |
Time series | An Arima algorithm for seasonal adjustment based on the open-source X-13ARIMA-SEATS package. | |
Includes an automatic ARIMA model selection program, mainly based on the program by Gomez and Maravall (1998) implemented in TRMO (1996) and subsequent revisions. | ||
Performs Prophet time series prediction on each row of MTable data and provides prediction results for the next time period. | ||
Aggregates a table into an MTable based on grouping columns. | ||
Expands an MTable into a table. | ||
Recommendation | The FM (Factorization Machine) algorithm takes into account the interactions between features. It is a nonlinear model suitable for recommendation scenarios in e-commerce, advertising, and live streaming. | |
The Alternating Least Squares (ALS) algorithm performs model decomposition on sparse matrices, evaluates the values of missing items, and obtains the basic training model. | ||
An item recall algorithm. You can use it to measure item similarity based on the User-Item-User principle. | ||
A batch processing prediction component for Swing. You can use it to perform offline prediction based on the Swing training model and prediction data. | ||
etrec is a collaborative filtering algorithm based on item, with two columns as input and the TopN similarity between items as output. | ||
Calculates the hitrate results of recalls. Hitrate serves as an evaluation of result quality, with a higher hitrate indicating that the vectors produced by training achieve more accurate recall results. | ||
Outlier detection | Determines whether samples are abnormal based on the Local Outlier Factor (LOF) values of data samples. | |
Uses the sub-sampling algorithm, which reduces the computational complexity of the algorithm. It can identify anomalous points in data and has significant application effects in the field of anomaly detection. | ||
Different from traditional SVM, it is an unsupervised learning algorithm. You can use it to predict anomalous points by learning the boundary. | ||
Natural Language Processing | Extracts, refines, or summarizes key information from lengthy and repetitive text sequences. News headline summarization is a special case of text summarization. You can use it to call a specified pre-trained model to predict news text, thereby generating news headlines. | |
Performs offline prediction with the generated machine reading comprehension training model. | ||
Extracts, refines, or summarizes key information from lengthy and repetitive text sequences. News headline summarization is a special case of text summarization. You can use it to train models that generate news headlines, which summarize the central ideas and key information of news articles. | ||
Trains a machine reading comprehension model that quickly understands and answers questions based on given documents. | ||
Based on the AliWS (Alibaba Word Segmenter) lexical analysis system. It performs word segmentation on the content of specified columns, with spaces separating each word after segmentation. | ||
Converts a trituple table (row,col,value) into a KV table (row,[col_id:value]). | ||
A basic operation in the field of machine learning, mainly used in information retrieval, natural language processing, and bioinformatics. | ||
Calculates string similarity and filters out the top N most similar data. | ||
A pre-processing method in text analysis, used to filter noise in word segmentation results (such as "of", "is", or "ah"). | ||
A step in language model training. It generates n-grams based on words and counts the number of corresponding n-grams across the entire corpus. | ||
A simple and coherent short text in literature that can comprehensively and accurately reflect the central idea of the document. Automatic text summarization uses computers to automatically extract summary content from original documents. | ||
An important technology in natural language processing, specifically referring to extracting words that have strong relevance to the meaning of the article from the text. | ||
Splits text into sentences based on punctuation marks. It is primarily used for pre-processing before text summarization, converting a paragraph of text into a format where each sentence appears on a separate line. | ||
Based on semantic vector results from algorithms (such as word embeddings generated by Word2Vec), it calculates extension words (or extension sentences) for given words (or sentences) by finding the set of vectors with the closest distance to a particular vector. One application is to return a list of the most similar words based on word embeddings generated by Word2Vec, according to the input word. | ||
Maps articles to vectors. The input is a vocabulary. The output is a document vector table, a word vector table, or a vocabulary. | ||
A probability distribution model of a group of output random variables under the condition of a given group of input random variables. Its characteristic is that it assumes the output random variables constitute a Markov random field. | ||
Calculates the similarity between pairs of articles or sentences based on words, building upon string similarity. | ||
Counts the co-occurrence of all words in multiple articles and calculates the PMI (point mutual information) between each pair. | ||
An algorithm component based on the linearCRF online prediction model, mainly used for processing sequence labeling problems. | ||
Developed based on AliWS, it generates a word segmentation model based on parameters and custom dictionaries. | ||
Based on input strings (manually entered or read from a specified file), it uses a program to count the total number of words in these strings and how many times each word appears. | ||
A commonly used weighting technique for information retrieval and text mining. It is typically applied in search engines and can be used as a measure or rating of the relevance between documents and user queries. | ||
Set the topic parameter for the PLDA component to abstract different topics from each document. | ||
Uses neural networks to map words to vectors in K-dimensional space through training, and supports operations on the vectors that represent words while corresponding to semantics. The input is a word column or vocabulary, and the output is a word vector table and vocabulary. | ||
Network analysis | Outputs the depth and tree ID of each node. | |
Finds closely associated subgraph structures in a graph that meet the specified core degree. The maximum value of a node's core number is called the core number of the graph. | ||
Uses the Dijkstra algorithm to generate the shortest paths between a given node and all other nodes. | ||
Originated from web search ranking, it uses the link structure of web pages to calculate the ranking of each web page. | ||
A graph-based semi-supervised learning method. Its basic principle is that a node's label (community) depends on the label information of its adjacent nodes, with the degree of influence determined by node similarity, and stability is achieved through propagation and iterative updates. | ||
A semi-supervised classification algorithm that uses the label information of labeled nodes to predict the label information of unlabeled nodes. | ||
A metric used to evaluate community network structures, which assesses the cohesiveness of communities divided within the network structure. Generally, values above 0.3 indicate a relatively distinct community structure. | ||
In an undirected graph G, if there is a path connecting vertex A to vertex B, then A and B are said to be connected. In graph G, there exist several subgraphs. If all vertices within each subgraph are connected, but no vertices between different subgraphs are connected, then these subgraphs of graph G are called maximum connected subgraphs. | ||
Calculates the density around each node in an undirected graph G. The density of a star network is 0, and the density of a fully connected network is 1. | ||
Calculates the density around each edge in an undirected graph G. | ||
Outputs all triangles in an undirected graph G. | ||
Financials | Performs normalization, discretization, indexation, or WOE conversion on data. | |
A commonly used modeling tool in the credit risk assessment field. It works by discretizing original variables through binning input, then using linear models (such as logistic regression or linear regression) for model training. It includes features such as feature selection and score transformation. | ||
Scores the raw data based on the model results produced by the scorecard training component. | ||
Performs feature discretization, which segments continuous data into multiple discrete intervals. It supports equal-frequency binning, equal-width binning, and automatic binning. | ||
An important indicator for measuring the shift caused by sample changes, commonly used to measure the stability of samples. | ||
Visual algorithms | Trains image classification model for inference. | |
Trains video classification models for inference. | ||
Builds object detection models that identify and frame high-risk entities in images. | ||
Directly trains raw unlabeled images to obtain a model for image feature extraction. | ||
Builds a metric learning model for model inference. | ||
If your business scenario involves human-related keypoint detection, you can use this component to build a keypoint model for inference. | ||
Provides mainstream model quantization algorithms. You can use it to compress and accelerate models, achieving high-performance inference. | ||
Provides the mainstream model pruning algorithm AGP (taylorfo). You can use it to compress and accelerate models, achieving high-performance inference. | ||
Tools | A data structure stored in MaxCompute. Models generated by traditional machine learning algorithms based on the PAICommand framework are stored in offline model format in the corresponding MaxCompute project. You can use offline model-related components to obtain offline models for offline prediction. | |
Exports models trained in MaxCompute to a specified OSS path. | ||
Custom scripts | Calls Alink's classification algorithms for classification, regression algorithms for regression, recommendation algorithms for recommendations, and more. PyAlink script also supports seamless integration with other algorithm components in Designer to build business traces and verify their effectiveness. | |
Adds multi-date loop execution functionality on top of the regular SQL script component, used for parallel execution of daily SQL tasks within a specific time period. | ||
Beta components | A compression estimation algorithm. | |
Supports both sparse and dense data formats. You can use this component to predict numeric variables, such as loan amount prediction, temperature prediction, etc. | ||
Predicts numeric variables, including housing price prediction, sales volume prediction, humidity prediction, etc. | ||
The most commonly used regularization method for regression analysis of ill-posed problems. |