Skip to main content

Introduction

This guide is your definitive resource for integrating and utilizing the Syngen Toolkit within ELITEA. It provides a comprehensive, step-by-step walkthrough — from uploading your training data to configuring the toolkit in ELITEA and effectively using it within your Agents, Pipelines, and Chat. By following this guide, you will unlock AI-driven synthetic tabular data generation capabilities directly within the ELITEA platform, enabling you to create realistic, privacy-preserving datasets without exposing real production data. This integration empowers you to leverage AI-driven automation to generate high-quality synthetic data as part of your development, testing, and machine-learning workflows. Brief Overview of the Syngen Toolkit The Syngen Toolkit brings powerful unsupervised synthetic data generation to ELITEA. It uncovers the patterns, trends, and correlations hidden within any tabular dataset you provide and reproduces them in a fully synthetic form — meaning the generated data is statistically similar to your source data but contains no real records. The result is a referentially intact dataset that can replace production data exports in less protected environments, eliminating the need for manual data classification or obfuscation. Key capabilities include:
  • Model Training: Learn the statistical structure of any source tabular dataset and produce a reusable trained model, stored as an artifact in your designated ELITEA Artifacts bucket.
  • Synthetic Data Generation: Use a previously trained model to generate any desired number of synthetic rows, with optional reproducibility via a random seed.
  • Model Inspection: List all trained models registered in the bucket, including column metadata, training settings, and creation timestamps.
  • Broad Data Type Support: Floats, integers, datetime, text, categorical, and binary columns are all handled automatically through dedicated per-column-type neural modules.
  • Artifact-Backed Storage: All trained model archives and generated CSV outputs are persisted in your ELITEA Artifacts bucket, making them shareable and reusable across agents and pipeline runs.
Typical Use Cases:
  • Generating safe, privacy-preserving test data for QA and staging environments.
  • Augmenting small datasets for machine learning model development.
  • Producing synthetic datasets for demos, prototyping, and load testing.
  • Replacing production data exports with statistically equivalent synthetic versions.

System Integration with ELITEA

Unlike most ELITEA toolkits, the Syngen Toolkit does not require external API credentials or a separate credential creation step. Authentication is handled entirely within the platform — the toolkit uses your current ELITEA session to connect to the platform’s own Artifacts storage. The integration is a straightforward three-step process: Upload Training Data → Create Toolkit → Use in Agents, Pipelines, or Chat.

Step 1: Upload Your Training Data to an Artifacts Bucket

Before creating the toolkit, your source data file must be available in an ELITEA Artifacts bucket.
  1. Navigate to Artifacts: Open the sidebar and select Artifacts.
  2. Create or select a bucket: Either create a new bucket (for example, syngen-data) or use an existing one. Note the exact bucket name — you will need it when configuring the toolkit.
  3. Upload your training file: Upload your source CSV or Avro file to the bucket. Note the exact file name (for example, customers.csv). Artifacts
Although the syngen library defines Excel (.xlsx, .xls) as a supported format, Excel files will fail during training due to a known bug in the underlying library. If your source data is in Excel format, please convert it to CSV before uploading to the Artifacts bucket. CSV and Avro are the only fully supported training file formats at this time.

Step 2: Create the Syngen Toolkit

Once your training data is uploaded to the Artifacts bucket, create the Syngen Toolkit:
  1. Navigate to Toolkits Menu: Open the sidebar and select Toolkits.
  2. Create New Toolkit: Click the + Create button.
  3. Select Syngen: Choose SyngenToolkit from the list of available toolkit types.
  4. Configure Toolkit Settings:
    FieldDescriptionExample
    Toolkit NameDescriptive name for your toolkitSyngen - Customer Data Generator
    DescriptionOptional description for the toolkitGenerates synthetic customer records from the customers.csv template
    LLM ModelThe language model used to process tool requests and format responsesSelect from available models
    Bucket NameArtifacts bucket name for storing training data and generated models/outputssyngen-data
  5. Enable Desired Tools: In the “Tools” section, select the checkboxes next to the specific Syngen tools you want to enable. Enable only the tools your agents will actually use
    • Make Tools Available by MCP - (optional checkbox) Enable this option to make the selected tools accessible through external MCP clients
  6. Save Toolkit: Click Save to create the toolkit Syngen Toolkit

Available Tools:

The Syngen Toolkit provides the following three tools:
Tool NameDescriptionPrimary Use Case
Train modelTrains a synthetic data model on source data from the bucket. The trained model is packaged and saved back to the bucketTrain a new model from a template dataset
Generate dataGenerates synthetic data using a previously trained model stored in the bucket. Returns a CSV file with the generated rowsProduce synthetic rows from an existing model
List modelsLists all trained models in the bucket’s registry, including their column names, training settings, and timestampsInspect what models are available for generation
All three tools run as background (asynchronous) operations. They will display a thinking progress indicator in ELITEA Chat and Agents while the model trains or generates data. Training progress updates — such as the current epoch and loss value — are streamed back to the UI in real time so you can follow along as work progresses.

Tool Parameters

Train model
ParameterRequiredDefaultDescription
Model Name✔️ YesUnique name for the model. Used to identify and retrieve the model later. Example: customer_model
Training File Name✔️ YesFile name of the training data file in the bucket. Example: customers.csv
Batch SizeNo32Training batch size. Reduce this to save memory on large datasets
Drop NullNoFalseIf enabled, rows containing any missing values are dropped before training
EpochsNo10Number of training epochs. Higher values generally improve quality
Row LimitNo(all rows)Maximum number of rows to use for training. The toolkit takes the first N rows from the file (top-down truncation). Useful for large datasets or quick tests
Syngen internally slugifies the model name before using it as a directory: underscores are converted to hyphens and the name is lowercased (for example, Customer_Model becomes customer-model). The toolkit handles this automatically when packaging and retrieving models. However, always use the exact name you specified in model_name when calling generate_data — the toolkit resolves the normalized form for you.
Generate data
ParameterRequiredDefaultDescription
Model Name✔️ YesName of the trained model to use. Must exactly match the name used during training
Batch SizeNo32Generation batch size for memory management
Number Of Avro Preview RowsNo(none)Number of rows to preview when the output is in Avro format
Random SeedNo(none)Set a seed for fully reproducible generation results. Must be 0 or greater
SizeNo100Number of synthetic rows to generate
List models No parameters required. Returns a formatted list of all models registered in the configured bucket.

Testing Toolkit Tools

After configuring your Syngen Toolkit, you can test individual tools directly from the Toolkit detail page using the Test Settings panel. General Testing Steps:
  1. Select LLM Model: Choose a Large Language Model from the model dropdown in the Test Settings panel.
  2. Configure Model Settings: Adjust model parameters (Creativity, Max Completion Tokens, etc.) as needed.
  3. Select a Tool: Choose train_model, generate_data, or list_models from the available tools list.
  4. Provide Input: Enter the required parameters (for example, model_name and training_file_name).
  5. Run the Test: Execute the tool and wait for the response.
  6. Review the Response: Analyse the output to verify the tool is working correctly.
  • Verify that the Artifacts bucket name is correct and accessible before integrating with an agent.
  • Confirm the training file exists in the bucket before running a full agent workflow.
  • Test generation parameters and preview the synthetic output format.
For detailed instructions on how to use the Test Settings panel, see How to Test Toolkit Tools.

Step 3: Add the Syngen Toolkit to Your Workflows

Once the toolkit is saved, you can add it to your agents, pipelines, or chat conversations.

In Agents:

  1. Navigate to Agents: Open the sidebar and select Agents.
  2. Create or Edit Agent: Either create a new agent or select an existing agent to edit.
  3. Add Syngen Toolkit:
    • In the “TOOLKITS” section of the agent configuration, click the “+Toolkit” icon.
    • Select your configured Syngen toolkit from the dropdown list.
    • The toolkit will be added to your agent with the previously enabled tools available.
    Agents
Your agent can now train models and generate synthetic data as part of automated workflows.

In Pipelines:

  1. Navigate to Pipelines: Open the sidebar and select Pipelines.
  2. Create or Edit Pipeline: Either create a new pipeline or select an existing pipeline to edit.
  3. Add Syngen Toolkit:
    • In the “TOOLKITS” section of the pipeline configuration, click the “+Toolkit” icon.
    • Select your configured Syngen toolkit from the dropdown list.
    • The toolkit will be added to your pipeline with the previously enabled tools available.
    Pipelines

In Chat:

  1. Navigate to Chat: Open the sidebar and select Chat.
  2. Start New Conversation: Click +Create or open an existing conversation.
  3. Add Toolkit to Conversation:
    • In the chat Participants section, look for the Toolkits element.
    • Click the “Add Tools” icon to open the tools selection dropdown.
    • Select your configured Syngen toolkit from the dropdown list.
    • The toolkit will be added to your conversation with all previously enabled tools available.
  4. Use Toolkit in Chat: You can now request synthetic data generation directly by prompting the AI. Chat
  • “Train a model called customer_model using customers.csv with 20 epochs.”
  • “Generate 500 rows of synthetic customer data using customer_model.”
  • “List all trained models available in the bucket.”
  • “Generate a reproducible dataset of 200 rows using customer_model with random seed 42.”

Instructions and Prompts for Using the Syngen Toolkit

To effectively instruct your ELITEA Agent to use the Syngen Toolkit, provide clear and precise instructions within the Agent’s “Instructions” field. These guide the Agent on when and how to invoke the available Syngen tools.

Instruction Creation for Agents

When crafting instructions for the Syngen Toolkit, clarity and parameter precision are essential. All three tools run as background operations, so instructions should account for the fact that results may take a moment to appear while the model trains or generates data.
  • Direct and Action-Oriented: Use strong action verbs. For example, “Use the train_model tool…”, “Generate synthetic data using…”, “List all available models…”.
  • Parameter-Centric: Clearly enumerate each required and optional parameter, specifying whether values come from user input, fixed configuration, or a prior step’s output.
  • Step-by-Step Structure: For multi-step workflows (for example, train then generate), number the steps explicitly.
  • Add Conversation Starters: Include example prompts that users can use to trigger each workflow.
When instructing your Agent to use a Syngen tool, follow this structured pattern:
  1. State the Goal: Describe the objective. For example, “Goal: Train a synthetic data model on the uploaded customer dataset.”
  2. Specify the Tool: Identify the exact tool. For example, “Tool: Use the train_model tool.”
  3. Define Parameters: List all parameters with their values or sources.
  4. Describe Expected Outcome: State what a successful result looks like. For example, “Outcome: A trained model artifact will be saved to the bucket and a confirmation message returned.”
  5. Add Conversation Starters: Include example prompts. For example, “Conversation Starters: ‘Train a model on my data file.’, ‘Generate 500 synthetic rows.’”
Agent Instructions for a Train → Generate Workflow:
## Synthetic Data Generation Workflow

### Step 1: Train Model
1. Goal: Train a synthetic data model from the uploaded template file.
2. Tool: Use the train_model tool.
3. Parameters:
   - model_name: Ask the user for a model name, or use "default_model" if not provided.
   - training_file_name: Ask the user for the CSV file name in the bucket.
   - epochs: Use 20 unless the user specifies otherwise.
   - batch_size: Use 32 (default).
   - drop_null: Use False (default).
4. Outcome: Confirm to the user that training is complete and the model has been saved to the bucket.

### Step 2: Generate Data
1. Goal: Generate synthetic rows using the trained model.
2. Tool: Use the generate_data tool.
3. Parameters:
   - model_name: Use the same model_name from Step 1.
   - size: Ask the user how many rows to generate (default: 100).
   - random_seed: Ask the user if they need reproducible results; use the provided seed if so.
4. Outcome: A CSV file with the generated rows will be returned and saved to the bucket.

## Conversation Starters
- "Train a model on my data file."
- "Generate 500 synthetic rows."
- "What models are available?"

Real-World Usage Examples

The following examples demonstrate how to interact with the Syngen Toolkit in ELITEA Chat and Agents. Each example shows a realistic user request and the corresponding agent response.
Chat Example:
User: "Please train a model called 'customer_model' using 'customers.csv'. Use 20 epochs."

Agent Response:  Start training the synthetic data model now.

[Agent uses train_model tool — background processing]

✔️ Training completed for model 'customer_model'.
- Model saved as: customer_model_model.tgz
- Bucket: syngen-data
- Epochs: 20 | Batch size: 32 | Drop null: False
- Columns learned: customer_id, first_name, last_name, e_mail, country, city, age, account_balance

The model is ready for data generation.
Train
Chat Example:
User: "Generate 500 rows of synthetic customer data using 'customer_model'. Make it reproducible with seed 42."

Agent Response: Generate synthetic data now using the trained model.

[Agent uses generate_data tool — background processing]

✔️ Generated 500 rows of synthetic data for model 'customer_model'.
- Output file: customer_model_synthetic.csv
- Saved to bucket: syngen-data
- Random seed: 42 (fully reproducible)

The synthetic CSV file is ready in the Artifacts bucket.
Generate
Chat Example:
User: "What trained models are available?"

Agent Response: Check the model registry.

[Agent uses list_models tool — background processing]

Found 2 trained model(s):

1. Model: customer_model
   - Columns (8): customer_id, first_name, last_name, e_mail, country, city, age, account_balance
   - Training file: customers.csv
   - Model file: customer_model_model.tgz
   - Training settings:
     • Epochs: 20
     • Batch size: 32
     • Drop null: False
   - Created: 2025-06-15T10:23:41

2. Model: orders_model
   - Columns (5): order_id, customer_id, product, quantity, price
   - Training file: orders.csv
   - Model file: orders_model_model.tgz
   - Training settings:
     • Epochs: 10
     • Batch size: 32
     • Drop null: False
   - Created: 2025-06-14T08:55:10

List

  • Single table per model: Each training run learns from a single table. Multi-table or relational training is not supported through the toolkit.
  • Large dataset training time: Training on large datasets can be time-consuming. Use the Row limit parameter to cap the number of rows when testing or prototyping.
  • Parallel generation not supported: The generate_data tool always runs in single-thread (non-parallel) mode. A parallel execution option is not available in the current release.

Troubleshooting

Problem: The Syngen Toolkit option does not appear in the toolkit type selection list when clicking + Create.Troubleshooting Steps:
  1. Verify the feature is enabled: The Syngen Toolkit must be enabled on your ELITEA platform. If the option does not appear, contact your platform administrator to confirm it has been activated.
  2. Refresh the browser: After a new toolkit type becomes available, a browser refresh may be required to load the updated list.
Problem: The train_model tool returns an error indicating the training file could not be found or downloaded.Troubleshooting Steps:
  1. Verify the file exists in the bucket: Navigate to Artifacts → your bucket and confirm the training file is present with the exact file name you specified.
  2. Check the bucket name: Ensure the Bucket name in the toolkit configuration exactly matches the bucket name in Artifacts.
  3. Check the file name spelling: The file name is case-sensitive. Ensure it exactly matches the file name in the bucket, including the file extension.
  4. Verify bucket scope: Confirm the bucket belongs to the same ELITEA project as the toolkit.
Problem: Training fails with an error message or failure notification.Troubleshooting Steps:
  1. Reduce dataset size: Use the Row limit parameter to restrict training to the first N rows of your file, or lower the Batch size to decrease memory consumption. Large datasets or large batch sizes can trigger an out_of_memory error.
  2. Check data format validity: Ensure the training file is a valid, UTF-8 encoded file of a supported format (CSV or Avro) and is not corrupted or empty. Invalid or unparseable data produces an invalid_input error.
  3. Review the error message: The error response includes a human-readable description, an error_category field (such as training_failed, out_of_memory, invalid_input, or runtime_error), and a full stack trace to help pinpoint the root cause.
  4. Check for slugified name conflicts: Model names are normalized to lowercase with underscores converted to hyphens (my_modelmy-model). If two model names slugify to the same string, the second training run will overwrite the first. Use distinct model names to avoid this.
Problem: Training or generation fails with an authorization error, or the error message contains “Not authorized to access artifact”.Troubleshooting Steps:
  1. Verify the LLM Model setting: The toolkit authenticates against Artifacts storage using the API key associated with your selected LLM Model. Ensure the LLM Model field in the toolkit configuration points to a model that is active and accessible in your project.
  2. Check the project scope: The bucket must belong to the same ELITEA project as the toolkit. If the toolkit was configured in one project and the bucket was created in another, artifact access will be denied.
  3. Re-save the toolkit: If your API key has recently been rotated or refreshed, re-open the toolkit configuration and save it again to pick up the updated credentials.
Problem: The generate_data tool returns an error indicating the model could not be found.Troubleshooting Steps:
  1. Verify the model was trained successfully: Run list_models to confirm the model appears in the registry. If it does not, re-run train_model.
  2. Check the model name spelling: The model name in generate_data must exactly match the name used during training (case-sensitive, before normalization).
  3. Check the bucket: Confirm the model archive file ({model_name}_model.tgz) exists in the Artifacts bucket.
  4. Check for a partial training run: If a previous training run was cancelled or interrupted before the upload step, the model archive may not have been saved. Re-run train_model to produce a complete model.
Problem: list_models returns a message saying no trained models were found, even though models have been trained.Troubleshooting Steps:
  1. Check the bucket name: Verify the Bucket name in the toolkit configuration is correct and refers to the same bucket where training was run.
  2. Verify the registry file exists: Navigate to Artifacts → your bucket and look for _syngen_models_registry.json. If it does not exist, no training job has completed successfully in this bucket yet.
  3. Check for a registry update failure: If training completed but the registry was not updated (for example, due to a network interruption during upload), the model archive will exist in the bucket but will not appear in the list. Re-running training will regenerate the registry entry. The toolkit retries the registry update up to three times automatically, but a persistent failure may still leave it incomplete.
Problem: Training appears to freeze after starting — no epoch progress lines are visible for several seconds, or numerical progress bars seem to be missing from the output.Troubleshooting Steps:
  1. This is expected behavior: The toolkit batches all training output at one-second intervals. If an epoch completes in under one second, its output may be grouped with the next batch. Short pauses between updates are normal.
  2. Progress bars are intentionally hidden: Numerical progress bars produced by the training process (tqdm-style it/s counters) are automatically filtered from the output to reduce noise. Only epoch-level loss values are surfaced, in the format Training epoch N, loss: X.XXXX.
  3. Verify the process is still active: If there is no output for more than a few minutes, the training job may have silently failed. Check the error response or contact your platform administrator if the job does not complete.

Support Contact

If you encounter issues not covered in this guide or need additional assistance with Syngen Toolkit integration, please refer to Contact Support for detailed information on how to reach the ELITEA Support Team.

FAQ

No. Unlike toolkits that connect to external services (such as GitHub, Jira, or Confluence), the Syngen Toolkit has no external service to authenticate with. It connects exclusively to ELITEA’s own internal Artifacts storage using your existing session. There is no Credentials step in the setup flow — only the toolkit itself needs to be configured.
Training a model and running data generation can take anywhere from seconds to several minutes, depending on dataset size, the number of training epochs, and available resources. Running these operations in the background prevents ELITEA from becoming unresponsive during processing and allows real-time progress updates — such as training epoch progress — to be streamed back to the UI as they happen. You can also cancel a running job at any time.
Yes. The toolkit supports mid-job cancellation. When you stop a running invocation in ELITEA, the toolkit signals the background process to terminate gracefully. If the process does not stop within five seconds, it is force-killed. Note that if training is cancelled before the final upload step completes, the model archive will not be saved to the bucket and the registry will not be updated — re-run train_model to get a complete model.
The following source file formats are supported:
FormatExtensions
CSV and delimited text.csv, .tsv, .psv, .txt
Avro.avro
All files must be encoded in UTF-8.
Excel (.xlsx, .xls) files will fail during training due to a known bug in the syngen library. Convert your Excel file to CSV before uploading it to the Artifacts bucket.
The toolkit batches all training output at one-second intervals to avoid flooding the UI. Epochs that complete faster than one second may have their output grouped with the next batch, so you will not necessarily see every epoch individually. Additionally, numerical tqdm-style progress bars (it/s counters) are silently filtered to keep the output readable. Only epoch-level loss values are shown, in the format Training epoch N, loss: X.XXXX.
Yes. Because trained model archives are stored in your Artifacts bucket, any agent or pipeline configured with the same Syngen Toolkit (pointing to the same bucket) can use the same trained model for generation. You only need to train once and can generate data as many times as you need.
When invoking generate_data, set the random_seed parameter to any whole number (0 or greater). Using the same seed with the same model and the same size setting will always produce the same output, making your results fully reproducible.
The toolkit automatically creates a private working area for each job while it runs and deletes it entirely once the job finishes — whether it succeeds or fails. Only the final outputs (the model archive and generated CSV) are kept, and these are saved to your Artifacts bucket.
A small statistical sample of the training data is preserved inside the model archive that gets saved to your Artifacts bucket. This sample is used to support the generation process. If your source data is sensitive, treat the model archive with the same level of care as the original data and control who has access to the Artifacts bucket.
Yes, but training time increases with dataset size. For initial testing or prototyping, use the Row limit parameter to restrict training to the first N rows of your file. Once you are happy with the setup, you can remove the limit and run a full training job.

To further enhance your understanding of ELITEA workflows and toolkits, here are helpful internal resources: