Creating a Task

Last updated: 2026-03-12 10:37:15

Creation Steps

Filling in the Basic Information

1. Log in to the Tencent Cloud TI-ONE Platform (TI-ONE) console, choose Training Workshop > Task-based Modeling, and click New to start creating a training task.
2. On the basic information configuration page, fill in the following information:
Task Name: Only Chinese, English, numbers, underscores (_), and hyphens (-) are supported. It must start with a Chinese character, English letter, or number.
Region: region where the training task is located. It defaults to the region of the current list page.
Training image: You can choose from built-in training images of the platform, custom images, or built-in LLMs. For the list of built-in training images, see Built-in Training Image List. For the list of built-in training images, please refer to the Built-in Training Image List. Custom images support selecting images from Tencent Container Registry or filling in an external image address. For private images, a username and password are required. For custom image specifications, see Specifications for Custom Training Images. For instructions on using built-in LLMs for training, see Fine-Tuning Built-in Open-Source LLMs.
Training Mode: For the training modes supported by different training frameworks, see Built-in Training Image List.
CVM Instance Source and Resource Request: You can choose to select from CVM instances or purchase from the TI-ONE platform. If you choose to purchase from the TI-ONE platform, you need to select the CVM instance specifications and the number of nodes. If you choose to select from CVM instances, you need to first create a resource group and purchase nodes. After selecting the resource group, select the corresponding computing resources. For the billing specifications supported by the platform, see Billing Overview.
Use RDMA: When configuring full GPU cards and selecting more than 1 node, you can configure whether to enable RDMA. RDMA is enabled by default for multi-node tasks configured with full 8 GPU cards per node. When RDMA is enabled, tasks are preferentially scheduled to nodes that support RDMA. When there are insufficient IP addresses for RDMA NICs on a single node, tasks may be scheduled to multiple nodes.
Note:
1. Once a resource group is selected, an overview of the remaining GPUs in the resource group will be displayed. The displayed information includes the total number of GPU cards of each GPU model, the number of full GPU cards, and the number of non-full GPU cards (fragmented GPU cards). This helps you quickly understand the GPU distribution in the selected resource group. Based on the current task scenario, you can use full resources or non-full (fragmented) resources to effectively reduce overall resource fragmentation and improve the overall utilization of GPUs.
2. Click View Details to display a detailed resource dashboard on the right side of the current page. This dashboard displays the remaining available resources and total resources for each GPU card type. Click the drop-down icon to display all running tasks/services on the current node. This helps you quickly understand resource usage and facilitate resource usage coordination among teams.
Description: You can add a description of up to 500 characters for remarks.
Cloud Log Service (CLS) log shipping: CLS log shipping is disabled by default. The TI-ONE console displays logs for 15 days by default. If you want persistent storage of logs and services such as log search, you can enable CLS log shipping. After enabling this feature, you can ship logs to CLS (ensure that CLS has been activated).
Automatic restart: You can configure an automatic restart policy for the task. You need to set the maximum number of restart attempts, up to 10. If the number of restarts exceeds this limit, the task will be marked as abnormal. The trigger condition for automatic restart is an abnormal exit during task execution. This feature currently only supports training tasks with the billing mode set to yearly/monthly subscription and in the MPI, DDP, or Horovod training mode. Event information for automatic restarts can be viewed by selecting Events on the Task Details page.
Health check:
You can choose to enable the health check for the task. Once this feature is enabled, it will take some time to perform health checks on the nodes scheduled for the task. The check is performed before task execution, including both manual starts and automatic restarts. If the health check fails or times out, the task will be terminated, and any occupied resources will be released. Currently supported check items include NCCL network check and slow node check. The conditions for the checks are that the requested resources in the task include full GPU cards, and the training mode is DDP, MPI, or Horovod. An additional condition for the slow node check is that the number of nodes configured for the task is at least 2.
You can also configure the maximum check duration for the health check. When the check time exceeds this duration, and the check task has not ended normally, the training task will be terminated, and the occupied resources will be released. During the health check, you can choose Task Details > Logs and set the log type to Platform Logs (Health Check) to view the check logs and data.

Filling in Information for Task Configuration

On the task configuration page, you need to configure the algorithm, data, and input and output information about this training task. The configuration items are described as follows:
1. Storage path settings: Supported storage types include data sets, data sources, COS, CFS (including CFS Turbo), GooseFSx, GooseFS, and EMR (HDFS). GooseFS and GooseFSx are only supported for resource groups selected from CVM instances. For each configured storage path, its purpose can be selected, including self-built models, self-built code, self-built data, training data, and others. When the storage type is set to CFS, the purpose can also be set to the built-in models of the platform, allowing direct mounting of models from TI-ONE's built-in CFS to the training container.
If you select the data set type, you need to first create a data set by choosing Data Center > Data Set Management on the platform.
If you select the data source type, you need to first create a data source by choosing Platform Management > Data Source Management. Note: The mount permissions for data sources are divided into read-only mount and read-write mount. For data sources that are intended to receive training output results, configure read-write mount permissions.
If you select COS, you need to select the COS path where the data is located.
If you select CFS, GooseFSx, GooseFS, or EMR (HDFS), you need to select the CFS instance, GooseFSx instance, GooseFS cluster, or EMR cluster from the drop-down list and fill in the data source directory that you want the platform to mount.
For the above-mentioned data sources, you can define the mount path of the data within the training container during configuration. You need to fill in this path in your code to obtain the data. When creating a task, you can select multiple data sets or data paths, set different container mount paths respectively, and mount them to the container for the training algorithm to read.
Must-knows for using EMR (HDFS): The platform will access and mount HDFS using the Hadoop identity by default. To use another identity, upload the relevant configuration file in accordance with the following specifications for the code package.
The username and keytab file are provided by the user and placed in the code package.
Specifications for the code package: /<emr_id>/username.txt. (The content is the username, such as hadoop/172.0.1.5. When Kerberos authentication is not enabled, the default username hadoop can be used if the file does not exist or is empty. After Kerberos authentication is enabled, the default username becomes unavailable.)
/<emr_id>/emr.keytab (The content is the keytab authentication file.) (Since the platform supports multiple authentications for multiple EMR instances, you need to add <emr_id> to the directory, with a value such as emr-1rnhggsh.)
2. Git storage: You can select a Git repository and configure a storage path within the container. When the task starts, the files in the repository will be downloaded to the specified storage path. You need to choose Training Workshop > Git Repositories to create a repository first.
3. Startup command: You need to fill in the program entry command, which supports multiple lines. The default working directory is /opt/ml/code.
4. Training output: Select the COS path where you need to save the training output. By default, the platform will regularly upload the data in the /opt/ml/output path to the output COS path. To release the trained model to the model repository with a single click, you need to save the model output to the /opt/ml/model path. The platform will upload the data in this path to the COS path after the training is completed. If you set training storage to the file system, such as CFS, you can also choose not to configure the training output and directly write the training output to the mounted CFS file path.
5. Tuning Parameter: The filled-in hyperparameter JSON will be saved as the /opt/ml/input/config/hyperparameters.json file. You need to parse your code.
6. SSH Connection: You can choose whether to enable SSH connection (currently only supported when the CVM instance source is set to Select from CVM Instances). After enabling the SSH connection, you can access this instance from other CVM instances. You need to fill in the content of the ~/.ssh/id_rsa.pub file on the CVM instance initiating the SSH login. If the file does not exist, you can generate it using the ssh-keygen command. When initiating an SSH login, check whether the private key matches. To initiate an SSH login from multiple CVM instances, you need to fill in multiple public keys. You can add multiple keys (press Enter to enter multiple keys).
7. Add Port: If you need to access services started in dev machines externally, you can enable the feature for custom port configuration (currently only supported when the CVM instance source is set to Select from CVM Instances):
Note: Ensure that the resource group and the Cloud Load Balancer (CLB) instance belong to the same VPC network.
Service Name: Enter a name to distinguish between different custom services started in dev machines.
Access Protocol: Configure an access protocol, which can be TCP or UDP.
Listening Port: container port listened by the custom service process running in dev machines. The port is used to receive externally sent network requests. Only ports 1 to 65535 are supported.
Method for Service Access:
For SSH connections, access via Pod IP addresses within the VPC network is supported by default. The access address can be viewed on the details page after instance creation. You can choose to perform port mapping through the CLB access method.
For other custom services, the CLB access method is selected by default for port mapping.
Access Port: If you select CLB access, you need to add a mapping port. After the port is filled in, a listener will be created under the selected CLB instance, and a port will be allocated. Be careful not to fill in a port that is already occupied by a listener. Ports 1 to 65535 are supported.
Select CLB: After selecting CLB access, you need to select a CLB instance under your account. If no instance is available under the current account, you can create an instance in the CLB console.
After the instance is created, the custom port information and access address can be viewed on the dev machine details page.

Additionally, during the process of task configuration, note that the price of your current task configuration will be displayed in real time at the bottom. Once all information is configured, the task is created.

Description of the Preset Process for Built-in LLMs

Task-based Modeling includes multiple fine-tuning templates for LLMs. You can start tasks for fine-tuning built-in LLMs with a single click. For detailed best practices, see Fine-Tuning Built-in Open-Source LLMs. The following are descriptions of built-in fields:

Preset Storage Path Settings

For the first line Platform CFS: By default, the system has configured the supporting training code for fine-tuning the LLM for you.
For the second line Platform CFS: By default, the system has configured a set of sample data for fine-tuning the LLM for you. To use your custom business data to fine-tune the LLM, you can delete this line and add other storage sources at the bottom.
For the third line Platform CFS: By default, the system has configured a built-in model of the platform for you.
For the fourth line User CFS: You need to select your CFS instance and source path here. Container Mount Path is automatically populated by the system and requires no modification. To use a different CFS instance as the training output, you can delete this line and add a new one.
Note:
If you use your own business data for fine-tuning, you need to use the format specified by the platform or follow the description provided in the dataset_info.json data configuration file of llamafactory. For details, click here to view more information.

Preset Startup Command

The platform populates the startup command by default. Generally, you do not need to modify the startup command.

Preset Parameters for Tuning

The platform provides multiple preset parameters. You can directly modify the hyperparameter JSON to iterate the model. The following table describes the hyperparameters:
Hyperparameter
Default Value
Description
Stage
sft
Indicates the training mode, which can be sft, pt, or dpo. The default value is sft. Currently, the Stage parameter is only supported by Qwen3 series models.
Epoch
2
Number of training iterations.
BatchSize
1
Number of samples processed in each training iteration. A larger BatchSize value can speed up training, but also increases memory usage.
LearningRate
1e-5
A hyperparameter for updating weights during gradient descent. A value that is too high may prevent the model from converging, while a value that is too low will slow down the convergence speed.
Step
750
Interval (in steps) at which a model checkpoint is saved. Saving more checkpoints requires more storage space.
UseTilearn
True
Whether to enable Tencent's self-developed acceleration framework. Valid values are true and false. Setting this parameter to true will enable Tencent's self-developed acceleration framework for training. 3D parallel acceleration requires 8 or more GPU cards and necessitates configuring the pipeline parallelism (PP) and Tensor parallelism (TP) parameters. For details, see the angel-tilearn documentation. If this parameter is set to false, the open-source acceleration framework is used by default. This parameter is only available for certain models.
FinetuningType
Lora
When the Stage parameter is set to sft, you can customize the fine-tuning training mode, which can be LoRA or Full. In LoRA mode, the system keeps the parameters of the pre-trained LLM fixed and applies low-rank decomposition to the weight matrix, updating only the low-rank parameters during the training process. In Full mode, all model parameters are fully updated during fine-tuning, which requires more training resources.
MaxSequenceLength
2048
Maximum text sequence length, which can be reasonably configured based on your business data length. For example, if most business data is less than 2048 characters in length, setting MaxSequenceLength to 2048 will truncate any data exceeding this length to 2048, thereby reducing GPU memory pressure.
GradientAccumulationSteps
1
A parameter from the Hugging Face Trainer. The default value is 1, which increases the value of BatchSize.
GradientCheckPointing
True
A parameter from the Hugging Face Trainer. The default value is True, which is a time-for-GPU memory policy. Enabling this parameter optimizes GPU memory usage but slows down training speed.
DeepSpeedZeroStage
z3
DeepSpeed ZeRO stage configuration. The optional values are z0, z2, z2_offload, z3, and z3_offload. The default value is z3. This parameter is only available for certain models.
ResumeFromCheckpoint
True
Whether to automatically resume training from an existing checkpoint file. The default value is True, which indicates that if a checkpoint file exists in the output directory, training resumes from the latest checkpoint. If this parameter is set to False, the training process will restart from scratch. If this parameter is set to False and the output directory is not empty, an error will occur. It is recommended to use an empty directory for training output. To enable forced overwriting, manually add the parameter "overwrite_output_dir": true to enable file overwriting.
TilearnHybridTPSize
1
A Tilearn 3D parallelism parameter that specifies the dimension for TP, with a default value of 1. This parameter is only available for certain models.
TilearnHybridPPSize
1
A Tilearn 3D parallelism parameter that specifies the dimension for PP, with a default value of 1. This parameter is only available for certain models.