Annotation Guide

library(putior)

Introduction

This guide provides a complete reference for PUT annotation syntax. It covers all annotation formats, multi-language support, multiline annotations, and best practices.

New to putior? Start with the Quick Start guide to create your first diagram in 2 minutes.

PUT stands for PUT + Input + Output + R, reflecting the package’s core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations.

Annotation Basics

PUT annotations are special comments that describe workflow nodes. Start simple:

Minimal annotation (just a label):

# put label:"Load Data"

That’s all you need! putior will: - Auto-generate a unique ID - Default node_type to "process" - Default output to the filename

Add more detail as needed:

# put label:"Load Data", node_type:"input", output:"data.csv"

Full R script example:

# data_processing.R
# put label:"Load Customer Data", node_type:"input", output:"raw_data.csv"

# Your actual code
data <- read.csv("customer_data.csv")
write.csv(data, "raw_data.csv")

# put label:"Clean and Validate", input:"raw_data.csv", output:"clean_data.csv"

# Data cleaning code
cleaned_data <- data %>%
  filter(!is.na(customer_id)) %>%
  mutate(purchase_date = as.Date(purchase_date))

write.csv(cleaned_data, "clean_data.csv")

Python script example:

# analysis.py
# put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json"

import pandas as pd
import json

# Load cleaned data
data = pd.read_csv("clean_data.csv")

# Perform analysis
sales_summary = {
    "total_sales": data["amount"].sum(),
    "avg_order": data["amount"].mean(),
    "customer_count": data["customer_id"].nunique()
}

# Save results
with open("sales_report.json", "w") as f:
    json.dump(sales_summary, f)

Resulting diagram from both files:

flowchart TD
    load_data(["Load Customer Data"])
    clean_data["Clean and Validate"]
    analyze_sales["Sales Analysis"]

    %% Connections
    load_data --> clean_data
    clean_data --> analyze_sales

    %% Styling
    classDef inputStyle fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e40af
    class load_data inputStyle
    classDef processStyle fill:#ede9fe,stroke:#7c3aed,stroke-width:2px,color:#5b21b6
    class clean_data processStyle
    class analyze_sales processStyle

Extracting Annotations

Use the put() function to scan your files and extract workflow information:

# Scan all R and Python files in a directory
workflow <- put("./src/")

# View the extracted workflow
print(workflow)

The output is a data frame where each row represents a workflow node:

Column	Description
`file_name`	Which script contains this node
`file_type`	Programming language (r, py, sql, etc.)
`id`	Unique identifier for the node
`label`	Human-readable description
`node_type`	Type of operation (input, process, output)
`input`	Files consumed by this step
`output`	Files produced by this step

Custom properties you define are also included as additional columns.

Complete Syntax Reference

Basic Format

The general syntax for PUT annotations is:

# put property1:"value1", property2:"value2", property3:"value3"

Flexible Syntax Options

PUT annotations support several formats to fit different coding styles:

# put id:"my_node", label:"My Process"          # Standard format (matches logo)
#put id:"my_node", label:"My Process"           # Also valid (no space)
# put| id:"my_node", label:"My Process"         # Pipe separator
# put id:'my_node', label:'Single quotes'       # Single quotes
# put id:"my_node", label:'Mixed quotes'        # Mixed quote styles

Multiline Annotations

For complex annotations with many properties, use backslash (\) continuation:

R/Python style:

# put id:"complex_etl", \
#     label:"Complex ETL Process", \
#     node_type:"process", \
#     input:"raw_data.csv, config.yaml", \
#     output:"processed.parquet", \
#     author:"Data Team", \
#     version:"2.0"

SQL style:

--put id:"load_customers", \
--    label:"Load Customer Data", \
--    node_type:"input", \
--    output:"customers_table"
SELECT * FROM raw_customers;

JavaScript/TypeScript style:

//put id:"api_handler", \
//    label:"Process API Request", \
//    input:"request.json", \
//    output:"response.json"

Rules for multiline annotations:

End each line (except the last) with a backslash \
Start continuation lines with the same comment prefix
Continuation lines can have leading whitespace for readability
Properties can span multiple lines
The backslash must be the last character on the line (no trailing spaces)

Example with many properties:

# put id:"train_model", \
#     label:"Train Random Forest Model", \
#     node_type:"process", \
#     input:"features.csv, labels.csv", \
#     output:"model.rds, metrics.json", \
#     group:"machine_learning", \
#     stage:"3", \
#     estimated_time:"45min", \
#     memory_intensive:"true"

When Multiline Annotations Don’t Work:

Trailing spaces: Ensure backslash is the last character (no spaces after)

Missing prefix: Each continuation line needs the comment prefix (#, --, //)

Fallback: If multiline fails, use a single long line - readability is secondary to functionality

Debug: Use set_putior_log_level("DEBUG") to see exactly how lines are being parsed

Multi-Language Support

putior automatically uses the correct comment prefix based on file extension:

Comment Style	Languages	Extensions
`# put`	R, Python, Shell, Julia, Ruby, YAML	`.R`, `.py`, `.sh`, `.jl`, `.rb`, `.yaml`
`-- put`	SQL, Lua, Haskell	`.sql`, `.lua`, `.hs`
`// put`	JavaScript, TypeScript, C, Java, Go, Rust	`.js`, `.ts`, `.c`, `.java`, `.go`, `.rs`
`% put`	MATLAB, LaTeX	`.m`, `.tex`

SQL Example:

-- query.sql
--put id:"load_data", label:"Load Customer Data", output:"customers"
SELECT * FROM customers WHERE active = 1;

JavaScript Example:

// process.js
//put id:"transform", label:"Transform JSON", input:"data.json", output:"output.json"
const transformed = data.map(item => process(item));

MATLAB Example:

% analysis.m
%put id:"compute", label:"Statistical Analysis", input:"data.mat", output:"results.mat"
results = compute_statistics(data);

Block Comments

For languages with block comment support (JavaScript, TypeScript, C, C++, Java, Go, Rust, and other //-prefix languages), PUT annotations can also appear inside /* ... */ and /** ... */ block comments. Use a * line prefix:

JSDoc-style (recommended for JS/TS):

/**
 * put id:"load", label:"Load Data", node_type:"input"
 */
function loadData() { return fetch('/api/data'); }

C-style block comment:

/*
 * put id:"init", label:"Initialize System"
 */
void init() {}

Single-line block comment:

/* put id:"quick", label:"Quick Operation" */
const x = transform(data);

Multiple annotations can appear in one block:

/**
 * put id:"step_a", label:"Step A"
 * put id:"step_b", label:"Step B"
 */

Both single-line (//) and block (/* */) annotations can coexist in the same file. Languages without block comment syntax (R, Python, SQL, etc.) continue to use their single-line prefix only.

Core Properties

While putior accepts any properties you define, these are commonly used:

Property	Purpose	Example Values
`id`	Unique identifier	`"load_data"`, `"process_sales"`
`label`	Human description	`"Load Customer Data"`
`node_type`	Operation type	`"input"`, `"process"`, `"output"`
`input`	Input files	`"raw_data.csv"`, `"data/*.json"`
`output`	Output files	`"processed_data.csv"`

Standard Node Types

For consistency across projects, use these standard node types:

Type	Mermaid Shape	Use For
`input`	Stadium `([...])`	Data sources, file loading, API inputs
`process`	Rectangle `[...]`	Data transformation, analysis, computation (default)
`output`	Subroutine `[[...]]`	Report generation, data export, visualization
`decision`	Diamond `{...}`	Conditional logic, branching workflows
`start`	Stadium `([...])`	Workflow entry point (gets boundary styling)
`end`	Stadium `([...])`	Workflow exit point (gets boundary styling)

artifact nodes (cylinder shape) are automatically created by put_diagram(show_artifacts = TRUE) for data files referenced in input/output fields. You don’t set node_type:"artifact" manually.

Visual representation of node types:

flowchart TD
    load(["Load Data (input)"])
    transform["Transform (process)"]
    export[["Export (output)"]]
    check{"Validate? (decision)"}

    %% Connections
    load --> transform
    transform --> export
    transform --> check

    %% Styling
    classDef inputStyle fill:#dbeafe,stroke:#2563eb,stroke-width:2px,color:#1e40af
    class load inputStyle
    classDef processStyle fill:#ede9fe,stroke:#7c3aed,stroke-width:2px,color:#5b21b6
    class transform processStyle
    classDef outputStyle fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#15803d
    class export outputStyle
    classDef decisionStyle fill:#fef3c7,stroke:#d97706,stroke-width:2px,color:#92400e
    class check decisionStyle

Custom Properties

Add any properties you need for visualization or metadata:

# put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high"

These custom properties can be used by visualization tools or workflow management systems.

Advanced Usage

Processing Individual Files

You can process single files instead of entire directories:

# Process a single file
workflow <- put("./scripts/analysis.R")

Recursive Directory Scanning

Include subdirectories in your scan:

# Search subdirectories recursively
workflow <- put("./project/", recursive = TRUE)

Custom File Patterns

Control which files are processed:

# Only R files
workflow <- put("./src/", pattern = "\\.R$")

# R and SQL files only
workflow <- put("./src/", pattern = "\\.(R|sql)$")

# All supported file types (default)
workflow <- put("./src/", pattern = "\\.(R|r|py|sql|sh|jl)$")

Including Line Numbers

For debugging annotation issues, include line numbers:

# Include line numbers for debugging
workflow <- put("./src/", include_line_numbers = TRUE)

Validation Control

Control annotation validation:

# Enable validation (default) - provides helpful warnings
workflow <- put("./src/", validate = TRUE)

# Disable validation warnings
workflow <- put("./src/", validate = FALSE)

Automatic ID Generation

If you omit the id field, putior will automatically generate a unique UUID:

# Annotations without explicit IDs get auto-generated UUIDs
# put label:"Load Data", node_type:"input", output:"data.csv"
# put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv"

# Extract workflow - IDs will be auto-generated
workflow <- put("./")
print(workflow$id)  # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"

Note: If you provide an empty id (e.g., id:""), you’ll get a validation warning.

Automatic Output Defaulting

If you omit the output field, putior automatically uses the file name as the output:

# In process_data.R:
# put label:"Process Step", node_type:"process", input:"raw.csv"
# No output specified - will default to "process_data.R"

# In analyze_data.R:
# put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv"
# This creates a connection from process_data.R to analyze_data.R

This feature ensures that scripts can be connected in workflows even when explicit output files aren’t specified.

Tracking Source Relationships

When you have scripts that source other scripts, use this annotation pattern:

# In main.R (sources other scripts):
# put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf"
source("load_data.R")    # Reading load_data.R into main.R
source("process_data.R") # Reading process_data.R into main.R

# In load_data.R (sourced by main.R):
# put label:"Data Loader", node_type:"input"
# output defaults to "load_data.R"

# In process_data.R (sourced by main.R, depends on load_data.R):
# put label:"Data Processor", input:"load_data.R"
# output defaults to "process_data.R"

This correctly shows the flow: sourced scripts are inputs to the main script.

Variable References with `.internal` Extension

putior supports tracking in-memory variables and objects using the .internal extension. This is useful for documenting computational steps within scripts while maintaining clear data flow between scripts.

Key Concepts

.internal variables: - Represent in-memory objects during script execution - Can only be outputs, never inputs between scripts - Help document what variables are created within each script - Example: my_data.internal represents a variable named my_data

Persistent files: - Enable actual data flow between scripts - Can be both inputs and outputs - Required for connected workflows - Example: my_data.RData, results.csv

Correct Usage Pattern

# Script 1: Create variable and save it
# put id:"create_data", output:"dataset.internal, dataset.RData"
dataset <- data.frame(x = 1:100, y = rnorm(100))
save(dataset, file = "dataset.RData")

# Script 2: Load data and create new variables
# put id:"analyze_data", input:"dataset.RData", output:"analysis.internal, summary.txt"
load("dataset.RData")  # Load the persistent file (NOT dataset.internal)
analysis <- summary(dataset)  # Create new in-memory variable
writeLines(capture.output(analysis), "summary.txt")

What NOT to Do

# INCORRECT: Using .internal as input between scripts
# put input:"dataset.internal"  # This is wrong!

# CORRECT: Use persistent files as inputs
# put input:"dataset.RData"     # This is correct!

Complete Example

Try the comprehensive variable reference example:

source(system.file("examples", "variable-reference-example.R", package = "putior"))

This creates a connected 4-script workflow demonstrating proper .internal usage and file-based data flow.

Real-World Example

Let’s walk through a complete data science workflow:

1. Data Collection (Python)

# 01_collect_data.py
# put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json"

import requests
import json

response = requests.get("https://api.example.com/sales")
data = response.json()

with open("raw_api_data.json", "w") as f:
    json.dump(data, f)

2. Data Processing (R)

# 02_process_data.R
# put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv"

library(jsonlite)
library(dplyr)

# Load raw data
raw_data <- fromJSON("raw_api_data.json")

# Process and clean
processed <- raw_data %>%
  filter(!is.na(sale_amount)) %>%
  mutate(
    sale_date = as.Date(sale_date),
    sale_amount = as.numeric(sale_amount)
  ) %>%
  arrange(sale_date)

# Save processed data
write.csv(processed, "processed_sales.csv", row.names = FALSE)

3. Analysis and Reporting (R)

# 03_analyze_report.R
# put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds"
# put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html"

library(dplyr)

# Load processed data
sales_data <- read.csv("processed_sales.csv")

# Perform analysis
analysis_results <- list(
  total_sales = sum(sales_data$sale_amount),
  monthly_trends = sales_data %>%
    group_by(month = format(sale_date, "%Y-%m")) %>%
    summarise(monthly_total = sum(sale_amount)),
  top_products = sales_data %>%
    group_by(product) %>%
    summarise(product_sales = sum(sale_amount)) %>%
    arrange(desc(product_sales)) %>%
    head(10)
)

# Save analysis
saveRDS(analysis_results, "analysis_results.rds")

# Generate report
rmarkdown::render("report_template.Rmd",
                  output_file = "sales_report.html")

4. Extract the Complete Workflow

# Extract workflow from all files
complete_workflow <- put("./sales_project/", recursive = TRUE)
print(complete_workflow)

This would show the complete data flow: API → JSON → CSV → Analysis → Report

Best Practices

1. Use Descriptive Names

Choose clear, descriptive names that explain what each step does:

# Good
# put id:"load_customer_transactions", label:"Load Customer Transaction Data"
# put id:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals"

# Less descriptive
# put id:"step1", label:"Load data"
# put id:"process", label:"Do calculations"

2. Document Data Dependencies

Always specify inputs and outputs for data processing steps:

# put id:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv"

3. Use Consistent Node Types

Stick to a standard set of node types across your team:

# put id:"load_raw_data", label:"Load Raw Sales Data", node_type:"input"
# put id:"clean_data", label:"Clean and Validate", node_type:"process"
# put id:"export_results", label:"Export Final Results", node_type:"output"

4. Add Helpful Metadata

Include metadata that helps with workflow understanding:

# put id:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true"

Use grouping properties to organize complex workflows:

# put id:"feature_engineering", label:"Engineer Features", group:"preprocessing", stage:"1"
# put id:"model_training", label:"Train Model", group:"modeling", stage:"2"
# put id:"model_evaluation", label:"Evaluate Model", group:"modeling", stage:"3"

Troubleshooting

Having issues with annotations? See the Troubleshooting Guide for:

Most Common Issues - Start here for quick solutions
Annotation Syntax Errors - Quote mismatches, invalid properties
File Pattern Matching - Files not being scanned
Debugging with Logging - Enable detailed output

Quick diagnostic:

# Test if your annotation is valid
is_valid_put_annotation('# put id:"test", label:"Test Node"')  # Should be TRUE

Guide	Description
Quick Start	First diagram in 2 minutes
Features Tour	Auto-detection, themes, logging
API Reference	Function documentation
Showcase	Real-world examples
Quick Reference	At-a-glance reference card
Troubleshooting	Common issues and solutions
AI Integration	MCP/ACP integration guide