Introduction
The putior
package helps you document and visualize
workflows by extracting structured annotations from your R and Python
source files. This vignette shows you how to get started with PUT
annotations and workflow extraction.
PUT stands for PUT + Input + Output + R, reflecting the package’s core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations.
Why Use putior?
- Automatic documentation: Your workflow documentation stays in sync with your code
- Multi-language support: Works with R, Python, SQL, and other file types
- Data lineage tracking: See how data flows through your processing steps
- Team collaboration: Help colleagues understand complex workflows
- Visual workflow creation: Extract structured data ready for flowchart generation
Quick Start
The fastest way to see putior in action is to run the built-in example:
# Run the complete example
source(system.file("examples", "reprex.R", package = "putior"))
This creates a sample multi-language workflow and demonstrates the workflow extraction capabilities of putior.
Basic Workflow
Step 1: Add PUT Annotations to Your Code
PUT annotations are special comments that describe workflow nodes. Here’s how to add them to your source files:
R script example:
# data_processing.R
#put id:"load_data", label:"Load Customer Data", node_type:"input", output:"raw_data.csv"
# Your actual code
data <- read.csv("customer_data.csv")
write.csv(data, "raw_data.csv")
#put id:"clean_data", label:"Clean and Validate", node_type:"process", input:"raw_data.csv", output:"clean_data.csv"
# Data cleaning code
cleaned_data <- data %>%
filter(!is.na(customer_id)) %>%
mutate(purchase_date = as.Date(purchase_date))
write.csv(cleaned_data, "clean_data.csv")
Python script example:
# analysis.py
#put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json"
import pandas as pd
import json
# Load cleaned data
data = pd.read_csv("clean_data.csv")
# Perform analysis
sales_summary = {
"total_sales": data["amount"].sum(),
"avg_order": data["amount"].mean(),
"customer_count": data["customer_id"].nunique()
}
# Save results
with open("sales_report.json", "w") as f:
json.dump(sales_summary, f)
Step 2: Extract the Workflow
Use the put()
function to scan your files and extract
workflow information:
# Scan all R and Python files in a directory
workflow <- put("./src/")
# View the extracted workflow
print(workflow)
Expected output:
#> file_name file_type input label id
#> 1 data_processing.R r <NA> Load Customer Data load_data
#> 2 data_processing.R r raw_data.csv Clean and Validate clean_data
#> 3 analysis.py py clean_data.csv Sales Analysis analyze_sales
#> node_type output
#> 1 input raw_data.csv
#> 2 process clean_data.csv
#> 3 process sales_report.json
Step 3: Understand the Results
The output is a data frame where each row represents a workflow node. The columns include:
- file_name: Which script contains this node
- file_type: Programming language (r, py, sql, etc.)
- id: Unique identifier for the node
- label: Human-readable description
- node_type: Type of operation (input, process, output)
- input: Files consumed by this step
- output: Files produced by this step
- Custom properties: Any additional metadata you defined
PUT Annotation Syntax
Basic Format
The general syntax for PUT annotations is:
#put property1:"value1", property2:"value2", property3:"value3"
Flexible Syntax Options
PUT annotations support several formats to fit different coding styles:
#put id:"my_node", label:"My Process" # Standard format
# put id:"my_node", label:"My Process" # Space after #
#put| id:"my_node", label:"My Process" # Pipe separator
#put id:'my_node', label:'Single quotes' # Single quotes
#put id:"my_node", label:'Mixed quotes' # Mixed quote styles
Core Properties
While putior accepts any properties you define, these are commonly used:
Property | Purpose | Example Values |
---|---|---|
id |
Unique identifier |
"load_data" , "process_sales"
|
label |
Human description | "Load Customer Data" |
node_type |
Operation type |
"input" , "process" ,
"output"
|
input |
Input files |
"raw_data.csv" , "data/*.json"
|
output |
Output files | "processed_data.csv" |
Standard Node Types
For consistency across projects, consider using these standard node types:
-
input
: Data collection, file loading, API calls -
process
: Data transformation, analysis, computation -
output
: Report generation, data export, visualization -
decision
: Conditional logic, branching workflows
Custom Properties
Add any properties you need for visualization or metadata:
#put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high"
These custom properties can be used by visualization tools or workflow management systems.
Advanced Usage
Processing Individual Files
You can process single files instead of entire directories:
# Process a single file
workflow <- put("./scripts/analysis.R")
Recursive Directory Scanning
Include subdirectories in your scan:
# Search subdirectories recursively
workflow <- put("./project/", recursive = TRUE)
Including Line Numbers
For debugging annotation issues, include line numbers:
# Include line numbers for debugging
workflow <- put("./src/", include_line_numbers = TRUE)
Automatic ID Generation
If you omit the id
field, putior will automatically
generate a unique UUID:
# Annotations without explicit IDs get auto-generated UUIDs
#put label:"Load Data", node_type:"input", output:"data.csv"
#put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv"
# Extract workflow - IDs will be auto-generated
workflow <- put("./")
print(workflow$id) # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
Note: If you provide an empty id
(e.g.,
id:""
), you’ll get a validation warning.
Automatic Output Defaulting
If you omit the output
field, putior automatically uses
the file name as the output:
# In process_data.R:
#put label:"Process Step", node_type:"process", input:"raw.csv"
# No output specified - will default to "process_data.R"
# In analyze_data.R:
#put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv"
# This creates a connection from process_data.R to analyze_data.R
This feature ensures that scripts can be connected in workflows even when explicit output files aren’t specified.
Tracking Source Relationships
When you have scripts that source other scripts, use this annotation pattern:
# In main.R (sources other scripts):
#put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf"
source("load_data.R") # Reading load_data.R into main.R
source("process_data.R") # Reading process_data.R into main.R
# In load_data.R (sourced by main.R):
#put label:"Data Loader", node_type:"input"
# output defaults to "load_data.R"
# In process_data.R (sourced by main.R, depends on load_data.R):
#put label:"Data Processor", input:"load_data.R"
# output defaults to "process_data.R"
This correctly shows the flow: sourced scripts are inputs to the main script.
Real-World Example
Let’s walk through a complete data science workflow:
1. Data Collection (Python)
# 01_collect_data.py
#put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json"
import requests
import json
response = requests.get("https://api.example.com/sales")
data = response.json()
with open("raw_api_data.json", "w") as f:
json.dump(data, f)
2. Data Processing (R)
# 02_process_data.R
#put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv"
library(jsonlite)
library(dplyr)
# Load raw data
raw_data <- fromJSON("raw_api_data.json")
# Process and clean
processed <- raw_data %>%
filter(!is.na(sale_amount)) %>%
mutate(
sale_date = as.Date(sale_date),
sale_amount = as.numeric(sale_amount)
) %>%
arrange(sale_date)
# Save processed data
write.csv(processed, "processed_sales.csv", row.names = FALSE)
3. Analysis and Reporting (R)
# 03_analyze_report.R
#put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds"
#put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html"
library(dplyr)
# Load processed data
sales_data <- read.csv("processed_sales.csv")
# Perform analysis
analysis_results <- list(
total_sales = sum(sales_data$sale_amount),
monthly_trends = sales_data %>%
group_by(month = format(sale_date, "%Y-%m")) %>%
summarise(monthly_total = sum(sale_amount)),
top_products = sales_data %>%
group_by(product) %>%
summarise(product_sales = sum(sale_amount)) %>%
arrange(desc(product_sales)) %>%
head(10)
)
# Save analysis
saveRDS(analysis_results, "analysis_results.rds")
# Generate report
rmarkdown::render("report_template.Rmd",
output_file = "sales_report.html")
Best Practices
1. Use Descriptive Names
Choose clear, descriptive names that explain what each step does:
# Good
#put name:"load_customer_transactions", label:"Load Customer Transaction Data"
#put name:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals"
# Less descriptive
#put name:"step1", label:"Load data"
#put name:"process", label:"Do calculations"
2. Document Data Dependencies
Always specify inputs and outputs for data processing steps:
#put name:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv"
3. Use Consistent Node Types
Stick to a standard set of node types across your team:
#put name:"load_raw_data", label:"Load Raw Sales Data", node_type:"input"
#put name:"clean_data", label:"Clean and Validate", node_type:"process"
#put name:"export_results", label:"Export Final Results", node_type:"output"
4. Add Helpful Metadata
Include metadata that helps with workflow understanding:
#put name:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true"
5. Group Related Operations
Use grouping properties to organize complex workflows:
#put name:"feature_engineering", label:"Engineer Features", group:"preprocessing", stage:"1"
#put name:"model_training", label:"Train Model", group:"modeling", stage:"2"
#put name:"model_evaluation", label:"Evaluate Model", group:"modeling", stage:"3"
Troubleshooting
No Annotations Found
If put()
returns an empty data frame:
- Check file patterns: Ensure your files match the pattern (default: R, Python, SQL, shell, Julia)
-
Verify annotation syntax: Use
is_valid_put_annotation()
to test individual annotations - Check file paths: Ensure the directory exists and contains the expected files
# Test annotation syntax
is_valid_put_annotation('#put name:"test", label:"Test Node"') # Should return TRUE
is_valid_put_annotation("#put invalid syntax") # Should return FALSE
# Check what files are found
list.files("./src/", pattern = "\\.(R|py)$")
Validation Warnings
If you see validation warnings:
-
Missing name: Add a
name
property to all annotations -
Invalid node_type: Use standard types
(
input
,process
,output
) - File extensions: Ensure file references include extensions
# Enable detailed validation output
workflow <- put("./src/", validate = TRUE, include_line_numbers = TRUE)
Parsing Issues
If annotations aren’t parsed correctly:
- Check quotes: Ensure all values are properly quoted
- Escape commas: Values with commas should be in quotes
- Avoid nested quotes: Use consistent quote styles
Good example:
#put name:"step1", description:"Process data, clean outliers", type:"process"
Problematic example:
#put name:"step1", description:Process data, clean outliers, type:process
Next Steps
Now that you understand the basics of putior:
-
Try the complete example:
source(system.file("examples", "reprex.R", package = "putior"))
- Add annotations to your existing projects: Start with key data processing scripts
- Build visualization tools: Use the extracted workflow data to create flowcharts
- Integrate into CI/CD: Automatically update workflow documentation
- Explore advanced features: Check out the advanced usage vignette
For more detailed information, see: - ?put
- Complete
function documentation - Advanced usage vignette - Complex workflows and
integration - Best practices vignette - Team collaboration and style
guides