Session 12: Reports & Documentation

Session 12: Creating Professional Reports

Documentation, Jupyter Notebooks & Reproducibility

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Today's Goals

Part 1: Why Documentation Matters

Part 2: Jupyter Notebooks as Reports

Part 3: Exporting & Sharing

Part 4: Hyperparameter Tuning

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Part 1: Why Documentation Matters

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

The Documentation Problem

Consider this scenario: You open a Jupyter notebook from 6 months ago...

df2 = df.groupby('x')[['y', 'z']].agg(f).reset_index()
result = df2.merge(df3, on='id', how='left')
final = result[result['val'] > threshold]

Questions you'll ask yourself:

  • What is x, y, z?
  • What does f do?
  • Where did df3 come from?
  • What is threshold and why this value?
  • What business question does this answer?
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Code is Read More Than Written

Reality:

  • You write code once
  • You (or others) read it dozens of times
  • Your future self is your most important audience

Good documentation:

  • Explains the "why", not just the "what"
  • Makes code maintainable
  • Enables collaboration
  • Saves time in the long run

"Programs must be written for people to read, and only incidentally for machines to execute." — Harold Abelson

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Types of Documentation

1. Code Comments

# Calculate year-over-year growth rate
growth = (current - previous) / previous * 100
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

2. Docstrings

def calculate_roi(investment, return_value):
    """
    Calculate return on investment.

    Args:
        investment (float): Initial investment amount
        return_value (float): Final value

    Returns:
        float: ROI as a percentage
    """
    return (return_value - investment) / investment * 100
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Types of Documentation (cont.)

3. README files - we already know!

  • Project overview
  • Installation instructions
  • Usage examples
  • Dependencies
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

4. Analysis narratives (Jupyter notebooks)

  • Problem statement
  • Data description
  • Methodology
  • Results and interpretation

5. Technical documentation

  • API references
  • Architecture diagrams
  • Design decisions
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Documentation Best Practices

Do:

  • ✅ Explain why, not just what
  • ✅ Keep it up-to-date
  • ✅ Use clear, simple language
  • ✅ Provide examples
  • ✅ Document assumptions and limitations
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Don't:

  • ❌ State the obvious: x = x + 1 # increment x
  • ❌ Write novels (be concise)
  • ❌ Use jargon unnecessarily
  • ❌ Let documentation drift from code
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Part 2: Jupyter Notebooks as Reports

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Notebooks as Analysis Narratives

A good data analysis notebook tells a story:

  1. Introduction: What question are we answering?
  2. Data: What data are we using? Where from?
  3. Exploration: What do we see in the data?
  4. Analysis: What methods did we apply?
  5. Results: What did we find?
  6. Conclusion: What does it mean? Next steps?

Think of it as: A research paper, not a code dump

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Notebook Structure Example

# Sales Analysis Q4 2025

## Executive Summary
This analysis examines Q4 sales performance across regions...

## Data Sources
- Sales data: `data/sales_q4.csv`
- Date range: Oct 1 - Dec 31, 2025
- 15,234 transactions across 4 regions

## Key Findings
1. North region outperformed by 23%
2. Mobile sales increased 45% YoY
3. December showed highest conversion rate (18.5%)

[Analysis continues...]
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Markdown in Jupyter Notebooks

Headers:

# H1 - Main Title
## H2 - Section
### H3 - Subsection

Emphasis:

**bold** or __bold__
*italic* or _italic_
***bold and italic***

Lists:

- Unordered list
  - Nested item

1. Ordered list
2. Second item
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

More Markdown Features

Links:

[Link text](https://example.com)
[Link to section](#section-name)

Images:

![Alt text](path/to/image.png)
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Code:

Inline `code`

```python
# Code block
def hello():
    print("Hello")

**Tables**:
```markdown
| Product | Sales |
|---------|-------|
| A       | $100K |
| B       | $150K |
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Mathematical Notation (LaTeX)

Jupyter supports LaTeX for mathematical notation:

Inline: $E = mc^2$

Block:

$$
\text{ROI} = \frac{\text{Gain} - \text{Cost}}{\text{Cost}} \times 100\%
$$

Renders as:

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Organizing Code Cells

Bad: One giant cell with everything

# DON'T DO THIS
import pandas as pd
import numpy as np
# ... 200 lines of code ...
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Good: Logical, sequential cells

# Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Cell 2: Load data
df = pd.read_csv('data/sales.csv')
df.head()
# Cell 3: Data cleaning
df = df.dropna()
df['date'] = pd.to_datetime(df['date'])
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Code Cell Best Practices

1. One logical step per cell

  • Easier to debug
  • Can re-run parts independently
  • Clear progression

2. Display results

# Show what you did
print(f"Removed {null_count} null values")
print(f"Final dataset: {len(df)} rows")
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

3. Add markdown between code

## Data Cleaning
We need to handle missing values and convert dates.

4. Use meaningful variable names

# Good
revenue_by_region = df.groupby('region')['revenue'].sum()

# Bad
x = df.groupby('a')['b'].sum()
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Visualizations in Notebooks

Always add context:

# Create visualization
plt.figure(figsize=(12, 6))
plt.plot(df['date'], df['sales'])
plt.title('Daily Sales - Q4 2025', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Sales ($)')
plt.grid(True, alpha=0.3)
plt.show()
**Figure 1**: Daily sales show clear weekly patterns with
peaks on Fridays and dips on Sundays. Note the spike on
Black Friday (Nov 24).

Explain what the reader should notice!


---

# Your Project Report: Required Structure

**8 sections (aligned with professor's requirements):**

1. **Abstract** (~200 words)
2. **Introduction** - research question, motivation
3. **Research Question & Literature** - context, related work
4. **Methodology** - data, algorithms, evaluation approach
5. **Implementation** - key decisions, challenges (can be brief)
6. **Codebase & Reproducibility** - how to run it (2-3 sentences OK)
7. **Results** - tables, figures, interpretation
8. **Conclusion** - summary, limitations, future work

**Appendix**: AI tools used (required if applicable)

---

# Report Tips

**Length**: ~10 pages (min 8, excluding references)

**Where to spend your pages**:
- Methodology + Results = bulk of your report
- Implementation/Codebase can be short if straightforward

**Common mistakes**:
- ❌ Too much code in the report (use appendix)
- ❌ Figures without interpretation
- ❌ Missing research question
- ❌ No discussion of limitations

**Remember**: Explain the "why", not just the "what"

---

# Part 3: Exporting & Sharing

---

# Exporting Notebooks

**1. HTML** (most common):
```bash
jupyter nbconvert --to html notebook.ipynb

2. PDF (requires LaTeX):

jupyter nbconvert --to pdf notebook.ipynb

3. Markdown:

jupyter nbconvert --to markdown notebook.ipynb

4. Python script:

jupyter nbconvert --to python notebook.ipynb
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Customizing Exports

Remove code cells (show only results):

jupyter nbconvert --to html notebook.ipynb \
  --no-input

Remove output (show only code):

jupyter nbconvert --to html notebook.ipynb \
  --no-output

Hide specific cells: Add tags in Jupyter

  • View → Cell Toolbar → Tags
  • Add "remove_cell" tag
  • Use --TagRemovePreprocessor.remove_cell_tags='{"remove_cell"}'
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Creating Shareable Reports

Best format for different audiences:

Audience Format Why
Non-technical HTML (no code) Easy to view, looks professional
Technical HTML (with code) Can see methodology
Collaborators .ipynb file Can run and modify
Publication PDF Print-ready, formal
Web HTML + GitHub Pages Publicly accessible
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Reproducibility Reminder

You already know this — just a quick checklist:

✅ Dependencies in requirements.txt, environment.yml, or pyproject.toml
✅ Random seeds set (random_state=42)
✅ Relative file paths (not /Users/yourname/...)
✅ Clear README with setup instructions
✅ Notebook runs top-to-bottom without errors

Any reproducible environment works: conda, venv, uv, poetry...

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Git + Jupyter Best Practices

Problem: Jupyter notebooks have metadata and outputs that change

Solution: Clear outputs before committing

# Clear all outputs
jupyter nbconvert --clear-output --inplace notebook.ipynb

# Or use nbstripout
pip install nbstripout
nbstripout notebook.ipynb

Or: Configure git to auto-strip outputs

nbstripout --install
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Hands-On Exercise: Create a Report

You have sales data for the past year. Create a professional analysis notebook:

Data: sales_2025.csv (provided)

Your report should include:

  1. Title and executive summary
  2. Data loading and initial exploration
  3. At least 3 visualizations with explanations
  4. Summary statistics by category
  5. Key findings section
  6. Recommendations

Requirements:

  • Proper markdown formatting
  • Clear section headers
  • Code cells with comments
  • Interpretations of results
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Exercise: Sample Structure

# Annual Sales Analysis 2025

## Executive Summary
[Your findings in 2-3 sentences]

## Data Overview
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('data/sales_2025.csv')
df.head()
The dataset contains {len(df)} transactions from...
## Sales by Category
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Writing Good Interpretations

Bad interpretation:

"The graph shows sales over time."

Good interpretation:

"Sales increased steadily through Q1-Q3, peaking at $2.1M in
September before declining 15% in Q4. The Q4 decline is
primarily driven by reduced enterprise sales, which dropped
23% compared to Q3."

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Common Report Mistakes

1. No context: Jumping straight into code
2. Too much code: Showing every exploratory step
3. No interpretation: Figures without explanation
4. Unclear flow: Random order of analyses
5. No conclusion: Analysis without recommendations
6. Assuming knowledge: Not defining terms/metrics

Remember: Your notebook is a story, not just code!

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Creating Templates

Save time by creating reusable templates:

# In your templates/ folder
# data_analysis_template.ipynb

"""
Contains:
- Standard imports
- Data loading section
- Exploratory analysis section
- Visualization section
- Results section
- Conclusion section
"""
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Part 4: Hyperparameter Tuning

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

What Are Hyperparameters?

Parameters — learned from data:

  • Weights in linear regression
  • Split points in decision trees

Hyperparameters — set before training:

  • Number of trees in Random Forest
  • k in KNN
  • Learning rate
  • Max depth

Goal: Find the best hyperparameters for your model

Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

GridSearchCV: Exhaustive Search

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5]
}

# Create grid search
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,              # 5-fold cross-validation
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

GridSearchCV Results

# Best model is already fitted
best_model = grid_search.best_estimator_

# Evaluate on test set
test_score = best_model.score(X_test, y_test)
print(f"Test accuracy: {test_score:.3f}")

# See all results
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
results[['params', 'mean_test_score', 'rank_test_score']].head()

Note: GridSearchCV tries all combinations

  • 3 × 3 × 2 = 18 combinations
  • With 5-fold CV = 90 model fits!
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

RandomizedSearchCV: Faster Alternative

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define distributions
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5)
}
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=20,         # Only try 20 random combinations
    cv=5,
    random_state=42
)
random_search.fit(X_train, y_train)
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Hyperparameter Tuning Best Practices

1. Start simple

  • First get a baseline with defaults
  • Then tune the most impactful parameters

2. Use cross-validation

  • Never tune on test set!
  • CV prevents overfitting to validation data
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

3. Common parameters to tune:

Model Key Hyperparameters
Random Forest n_estimators, max_depth, min_samples_split
KNN n_neighbors, weights
Logistic Regression C, penalty
SVM C, kernel, gamma

4. Document your search

  • Report which parameters you tried
  • Include in your methodology section
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Deep Learning Hyperparameters (Simple)

If you're using a neural network, key hyperparameters:

Parameter What it does Typical values
Learning rate Step size for updates 0.001, 0.01, 0.1
Batch size Samples per update 16, 32, 64, 128
Epochs Training iterations 10-100+
Hidden layers Network depth 1-3 for simple tasks
Neurons per layer Network width 32, 64, 128
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Simple approach: Start with defaults, then try 2-3 values for learning rate

# Example with sklearn MLPClassifier
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(64, 32),
                    learning_rate_init=0.001,
                    max_iter=200, random_state=42)
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Beyond This Course: Real Hyperparameter Tuning

For info only — what professionals use:

Tool What it does
Optuna Smart search (Bayesian optimization)
Ray Tune Distributed tuning across machines
W&B Sweeps Track experiments + automatic tuning
Keras Tuner Built-in for TensorFlow/Keras
Why they exist: GridSearch doesn't scale
  • 5 hyperparameters × 5 values each = 3,125 combinations
  • Bayesian methods find good values faster
Anna Smirnova, December 1, 2025
Session 12: Reports & Documentation

Questions?

Next week: Final project workshop - putting it all together!

Anna Smirnova, December 1, 2025