The dataset list page provides unified management of the collection of all datasets under the user account, displays key dataset information, supports operations such as label delete, and configuring schemas, and allows you to click the name of a dataset to go to the dataset details page.
List Fields
The fields displayed on the dataset list page and their meanings are explained as follows:
Name: displays the custom name of a dataset, and supports clicking the name to go to the dataset details page.
Annotation Progress/Total Dataset Size:
Datasets for large language modeling: According to the schema parsing rules, a complete sample is counted as 1. Samples labeled in the labeling workbench are counted as "labeled".
Dataset for Traditional CV Modeling (Images): counts the number of all images under the specified path. Images with labeling results are counted as "labeled".
Dataset for Big Data Modeling (Table): counts the number of rows.
Other: counts the number of all files under the specified path.
Modeling Task Type: a custom parameter for creating a dataset.
Dataset Tag:
Dataset for Large Language Modeling: custom tag content defined on the configuration page for creating a dataset.
Dataset for Traditional CV Modeling: By default, tags are added according to the image scenario to which the dataset belongs. Enumeration values include "image classification, image detection, image segmentation, image object tracking, and OCR".
Tag: displays the Tencent Cloud CAM tag selected during dataset import.
Status:
Datasets for Large Language Modeling
Upon successful creation, the dataset enters the Available state. However, you need to complete the schema configuration operation to view dataset details and perform labeling operations.
Datasets for Traditional CV Modeling
Import XX%: displays the progress of dataset import in real time in percentage form, from when you click Confirm on the import page until the dataset is successfully imported.
Available: The dataset transitions to the Available status once it is imported successfully or after the data source is successfully synchronized. It remains in the Available status if the label operation fails, the download operation fails, or the new version release operation fails. If the data source synchronization operation fails, the dataset transitions to the Failed status.
Failed: supports displaying the detailed reason for import failure or data source synchronization failure in a floating window.
Unavailable: The dataset is in the Unavailable status when it is being deleted.
Creation Time: records the time when the dataset was created and supports user-selected sorting in ascending or descending order for display.
Operations: The following provides a detailed introduction to the operation feature.
Operations - Annotation
This operation allows you to create data labeling tasks with a single click using this dataset.
If it is a dataset for large language modeling, after you click Confirm, the platform will automatically create the corresponding labeling task and redirect to the labeling workbench upon successful creation.
If it is a dataset for traditional CV modeling, after Confirm is clicked, the platform will redirect to the Data Center > Data Annotation > Create Annotation Task configuration page from within the current page. The dataset will be selected by default and cannot be modified.
Note
Only one labeling task can be created for a dataset at a time.
Only users with write permissions on the corresponding Cloud Object Storage (COS) or Cloud File Storage (CFS) path of this dataset can create data labeling tasks using this dataset.
Operations - Deletion
If it is an LLM dataset: Deleting this dataset will not affect the original data files stored on CFS, but only deletes this dataset record from Tencent Cloud TI-ONE Platform (TI-ONE).
If it is a non-LLM dataset:
The background unbinds the corresponding COS path from the dataset on TI-ONE and deletes this dataset record.
(Optional) When the dataset is deleted, the background automatically cleans the files in the COS bucket under the output path defined by the dataset. Only files under the output path are automatically cleaned, without affecting files in the original input path.
Operations - Schema Configuration
This operation is available only for datasets for large language modeling, and allows you to define various complex LLM&MLLM data content by customizing the schema information of the dataset. For schema configuration rules, see Detailed Syntax of Schema Configuration. On the left side of the configuration page, the platform displays partial content of your original data to assist you in referencing the left side while rewriting the schema configuration on the right side. Additionally, you can click Next: Preview Labeling Workbench to view the configuration effect of the schema in real time, ensuring that the configuration meets expectations. If you find that the labeling workbench does not meet expectations, you can click to go back to Previous: Basic Information to modify the configuration.
After the schema of the dataset is successfully configured, the page will automatically return to the dataset list. At this point, the platform will parse all samples in the data files based on your configured schema information. You can click View Progress in the Status column on the list page to view the progress of full parsing in real time.
Dataset Details Page
Datasets for Large Language Modeling
Clicking the name of a dataset for large language modeling allows you to view the dataset details, and the displayed content is presented according to the configured schema. For multimodal datasets, images and corresponding text content are directly paired and displayed in the same row to enhance the readability of data samples.
Datasets for Traditional CV Modeling
Note
For the Detail Preview and Data Pivot features on the dataset details page, COS provides 10T of free detail preview traffic per account each month. Any usage beyond this limit will incur charges. For details, see COS Billing Rules.
Clicking the name of the image dataset allows you to view the dataset details. The details page contains three main sections:
Basic Information: This section displays the key information of the dataset.
Visualization of Annotation Information: If the current dataset is bound with labeling information for Image Classification/Object Detection/Image Segmentation, this module will be displayed. In other scenarios, this module is automatically hidden. Note: The maximum number of tag values that can be statistically tracked in the background is 20 (displayed as Top 20 by proportion). Any categories beyond 20 will be grouped into the Other category.
Details Display: This section allows you to preview a list of the first 2000 images in the dataset. It also supports filtering by labeling status and by specified labeling categories for display.
Table Datasets for Big Data
Clicking the name of a table dataset allows you to view the dataset details. The details page contains two main sections:
Basic Information: This section displays the key information of the dataset.
Details Display: This section allows you to preview the first 2,000 rows of table content in the dataset. For enumerable columns, it supports clicking to view column data distribution information. The statistical analysis covers the entire dataset, not just the 2,000 rows previewed on the frontend.
Other Types of Datasets
Clicking the names of other types of datasets takes you to the dataset details page, which only displays the basic information of the dataset. Since there are no restrictions on the import format for this type of dataset, the details page does not support content preview.