Test Fixtures: Real Data Only Approach

Date: January 2026 Status: ✅ COMPLETE - All fixtures derived from real CEDAR data

Summary

All test fixtures are now derived from real CEDAR data files. No hardcoded test data exists - every fixture is created by sampling from actual production data, ensuring tests validate real data structures.

Philosophy

All test data must come from real data, not hardcoded values.

This ensures:

✅ Tests validate actual data structures
✅ Tests catch real-world edge cases
✅ Tests stay synchronized with production data
✅ No drift between test and production schemas
✅ Reproducible test data generation

Test Fixtures Created

All fixtures stored in: tests/testthat/fixtures/

Fixture File	Source	Rows	Derivation Method
cedar_sections_test.qs	cedar_sections.qs	12	4 sections per term (3 terms) from HIST, MATH, ANTH
cedar_students_test.qs	cedar_students.qs	60	20 students per term enrolled in test sections
cedar_programs_test.qs	cedar_programs.qs	21	7 program enrollments per term from test depts
cedar_degrees_test.qs	cedar_degrees.qs	15	5 degrees per term from test depts
cedar_faculty_test.qs	cedar_faculty.qs	24	3 faculty per term/dept combination

Fixture Generation Process

Script: `tests/testthat/create-test-fixtures.R`

Process:

Load full CEDAR data files

sections <- qread("data/cedar_sections.qs")    # 274,772 rows
students <- qread("data/cedar_students.qs")    # 2,940,164 rows
programs <- qread("data/cedar_programs.qs")    # 466,973 rows
degrees <- qread("data/cedar_degrees.qs")      # 62,616 rows
faculty <- qread("data/cedar_faculty.qs")      # 37,675 rows

Define test parameters

test_terms <- c(202510, 202560, 202580)  # Spring, Summer, Fall 2025
test_depts <- c("HIST", "MATH", "ANTH")   # Three test departments

Sample sections
- Filter by test terms and departments
- Take 4 sections per term (12 total)
- Ensures even distribution across terms
Sample students
- Filter to students enrolled in test sections
- Take 20 students per term (60 total)
- Maintains referential integrity (section_id links)
- Adds realistic grade data (completed vs in-progress terms)
Sample programs
- Filter by test terms and departments
- Take 7 program enrollments per term (21 total)
- Normalizes program names to match mappings
Sample degrees
- Filter by test departments (handles name mapping)
- Take 5 degrees per term (15 total)
- Maps degree_term → term, degree_type → degree
Sample faculty
- Filter by test terms and departments
- Take 3 faculty per term/dept (24 total)
- Real instructor records with job categories

Running Fixture Generation

cd /Users/fwgibbs/Dropbox/projects/cedar
Rscript tests/testthat/create-test-fixtures.R

Expected Output:

Creating test fixtures from CEDAR data files...
All fixtures derived from real data - no hardcoded test data

Original data loaded:
  sections: 274772 rows
  students: 2940164 rows
  programs: 466973 rows
  degrees: 62616 rows
  faculty: 37675 rows

✅ Test fixtures created in tests/testthat/fixtures/
   - cedar_sections_test.qs (12 rows)
   - cedar_students_test.qs (60 rows)
   - cedar_programs_test.qs (21 rows)
   - cedar_degrees_test.qs (15 rows)
   - cedar_faculty_test.qs (24 rows)

✓ ALL test fixtures derived from real CEDAR data - no hardcoded test data

Benefits of Real Data Fixtures

1. Schema Validation

Tests catch breaking changes to data structure:

# If cedar_students.qs loses subject_code column,
# test fixtures will also lose it, and tests will fail
# This is GOOD - it means tests are validating real data!

2. Real-World Edge Cases

Test data includes actual edge cases from production:

Students with multiple majors
Cross-listed courses
Part-term sections
Graduate/professional degrees
Multiple job categories for faculty

3. Automatic Synchronization

When production data changes:

Re-run create-test-fixtures.R
Fixtures automatically update with new structure
Tests validate against current reality

4. No Test Data Maintenance

Before (Hardcoded):

# Had to manually create test data
test_faculty <- data.frame(
  instructor_id = c("inst1", "inst2"),  # Fake IDs
  job_category = c("professor", "lecturer"),  # May not match real categories
  # ... manually maintain 10+ columns
)

After (Derived):

# Just sample from real data
test_faculty <- faculty %>%
  filter(term %in% test_terms, department %in% test_depts) %>%
  group_by(term, department) %>%
  slice_head(n = 3)

5. Realistic Relationships

Fixtures maintain real relationships:

Students → Sections (via section_id)
Sections → Faculty (via instructor_id)
Students → Programs (via student_id)
All foreign keys are real, not fabricated

Transformation Pipeline

The correct workflow for data → fixtures:

Source Data (MyReports)
         ↓
R/data-parsers/transform-to-cedar.R
         ↓
CEDAR Data Files (data/cedar_*.qs)
         ↓
tests/testthat/create-test-fixtures.R
         ↓
Test Fixtures (tests/testthat/fixtures/cedar_*_test.qs)
         ↓
Tests (tests/testthat/test-*.R, tests/test-dept-report-standalone.R)

Script Consolidation

Removed Duplicate Scripts

The following scripts were removed because R/data-parsers/transform-to-cedar.R handles all transformations comprehensively:

❌ R/transform-hr-to-cedar.R (duplicate)
❌ R/enhance-cedar-students.R (handled by transform-to-cedar.R)
❌ R/enhance-cedar-programs.R (handled by transform-to-cedar.R)
❌ R/enhance-cedar-degrees.R (handled by transform-to-cedar.R)

Single Source of Truth

Use: R/data-parsers/transform-to-cedar.R

This script handles:

DESRs → cedar_sections
class_lists → cedar_students
academic_studies → cedar_programs
degrees → cedar_degrees
hr_data → cedar_faculty

All transformations in one place, run as part of daily data pipeline.

Test Usage

Department Report Standalone Test

File: tests/test-dept-report-standalone.R

Before (Mock Data):

# Created fake faculty data
data_objects$cedar_faculty <- unique_instructors %>%
  mutate(
    instructor_name = paste0("Instructor ", row_number()),
    job_category = sample(c("professor", "lecturer"), n(), replace = TRUE)
  )

After (Real Data):

# Loads real faculty fixture
data_objects$cedar_faculty <- qread(file.path(fixtures_dir, "cedar_faculty_test.qs"))

Benefits:

Tests validate real faculty data structure
Tests catch if job_category values change
Tests work with actual instructor IDs
No maintenance of mock data logic

Maintenance

When to Regenerate Fixtures

Regenerate fixtures when:

Production data schema changes

Rscript tests/testthat/create-test-fixtures.R

Need different test coverage Edit create-test-fixtures.R to change:
- test_terms - which terms to include
- test_depts - which departments to sample
- Sample sizes (n=4, n=20, etc.)

After running transform-to-cedar.R

# Daily workflow:
Rscript R/data-parsers/transform-to-cedar.R  # Transform source → CEDAR
Rscript tests/testthat/create-test-fixtures.R  # Create test fixtures

Quality Checks

After regenerating fixtures, verify:

# Check fixture sizes
ls -lh tests/testthat/fixtures/cedar_*_test.qs

# Run tests to ensure they pass
Rscript tests/test-dept-report-standalone.R
R -e "devtools::test()"

Best Practices

DO ✅

Always derive fixtures from real CEDAR data

test_data <- real_data %>% filter(...) %>% sample_n(100)

Maintain referential integrity

# Students should reference actual section_ids from test_sections
test_students <- students %>%
  filter(section_id %in% test_sections$section_id)

Document sampling strategy

# Comment why you chose these parameters
test_terms <- c(202510, 202560, 202580)  # Spring, Summer, Fall for seasonal variation

Keep fixtures small but realistic
- Small: Fast test execution
- Realistic: Multiple terms, depts, edge cases

DON’T ❌

Never hardcode test data

# BAD
test_faculty <- data.frame(
  instructor_id = c("fake1", "fake2"),
  ...
)

Never skip fixture regeneration after schema changes
- If you add/remove columns from CEDAR data
- Always regenerate fixtures to stay in sync
Never commit large fixtures
- Keep fixtures under 1MB each
- Sample appropriately from full data
Never create transformation scripts outside data-parsers/
- Use R/data-parsers/transform-to-cedar.R
- Don’t create duplicate transformation logic

GLOBAL-CEDAR-IMPLEMENTATION-COMPLETE.md - Global.R CEDAR setup
CREDIT-HOURS-MIGRATION-COMPLETE.md - Credit hours CEDAR migration
TEST-FIXTURES-UPDATE.md - Multi-term fixtures documentation

Status: ✅ Complete All Fixtures: Derived from real CEDAR data No Hardcoded Data: Zero mock/fake test data Transformation: Consolidated in transform-to-cedar.R