Bad: One giant cell with everything
# DON'T DO THIS
import pandas as pd
import numpy as np
# ... 200 lines of code ...
Good: Logical, sequential cells
# Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Cell 2: Load data
df = pd.read_csv('data/sales.csv')
df.head()
# Cell 3: Data cleaning
df = df.dropna()
df['date'] = pd.to_datetime(df['date'])
1. One logical step per cell
2. Display results
# Show what you did
print(f"Removed {null_count} null values")
print(f"Final dataset: {len(df)} rows")
3. Add markdown between code
## Data Cleaning
We need to handle missing values and convert dates.
4. Use meaningful variable names
# Good
revenue_by_region = df.groupby('region')['revenue'].sum()
# Bad
x = df.groupby('a')['b'].sum()
Always add context:
# Create visualization
plt.figure(figsize=(12, 6))
plt.plot(df['date'], df['sales'])
plt.title('Daily Sales - Q4 2025', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Sales ($)')
plt.grid(True, alpha=0.3)
plt.show()
**Figure 1**: Daily sales show clear weekly patterns with
peaks on Fridays and dips on Sundays. Note the spike on
Black Friday (Nov 24).
Explain what the reader should notice!
---
# Your Project Report: Required Structure
**8 sections (aligned with professor's requirements):**
1. **Abstract** (~200 words)
2. **Introduction** - research question, motivation
3. **Research Question & Literature** - context, related work
4. **Methodology** - data, algorithms, evaluation approach
5. **Implementation** - key decisions, challenges (can be brief)
6. **Codebase & Reproducibility** - how to run it (2-3 sentences OK)
7. **Results** - tables, figures, interpretation
8. **Conclusion** - summary, limitations, future work
**Appendix**: AI tools used (required if applicable)
---
# Report Tips
**Length**: ~10 pages (min 8, excluding references)
**Where to spend your pages**:
- Methodology + Results = bulk of your report
- Implementation/Codebase can be short if straightforward
**Common mistakes**:
-
Too much code in the report (use appendix)
-
Figures without interpretation
-
Missing research question
-
No discussion of limitations
**Remember**: Explain the "why", not just the "what"
---
# Part 3: Exporting & Sharing
---
# Exporting Notebooks
**1. HTML** (most common):
```bash
jupyter nbconvert --to html notebook.ipynb
2. PDF (requires LaTeX):
jupyter nbconvert --to pdf notebook.ipynb
3. Markdown:
jupyter nbconvert --to markdown notebook.ipynb
4. Python script:
jupyter nbconvert --to python notebook.ipynb
Remove code cells (show only results):
jupyter nbconvert --to html notebook.ipynb \
--no-input
Remove output (show only code):
jupyter nbconvert --to html notebook.ipynb \
--no-output
Hide specific cells: Add tags in Jupyter
--TagRemovePreprocessor.remove_cell_tags='{"remove_cell"}'Best format for different audiences:
| Audience | Format | Why |
|---|---|---|
| Non-technical | HTML (no code) | Easy to view, looks professional |
| Technical | HTML (with code) | Can see methodology |
| Collaborators | .ipynb file | Can run and modify |
| Publication | Print-ready, formal | |
| Web | HTML + GitHub Pages | Publicly accessible |
You already know this — just a quick checklist:
Dependencies in
requirements.txt, environment.yml, or pyproject.toml
Random seeds set (
random_state=42)
Relative file paths (not
/Users/yourname/...)
Clear README with setup instructions
Notebook runs top-to-bottom without errors
Any reproducible environment works: conda, venv, uv, poetry...
Problem: Jupyter notebooks have metadata and outputs that change
Solution: Clear outputs before committing
# Clear all outputs
jupyter nbconvert --clear-output --inplace notebook.ipynb
# Or use nbstripout
pip install nbstripout
nbstripout notebook.ipynb
Or: Configure git to auto-strip outputs
nbstripout --install
You have sales data for the past year. Create a professional analysis notebook:
Data: sales_2025.csv (provided)
Your report should include:
Requirements:
# Annual Sales Analysis 2025
## Executive Summary
[Your findings in 2-3 sentences]
## Data Overview
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('data/sales_2025.csv')
df.head()
The dataset contains {len(df)} transactions from...
## Sales by Category
Bad interpretation:
"The graph shows sales over time."
Good interpretation:
"Sales increased steadily through Q1-Q3, peaking at $2.1M in
September before declining 15% in Q4. The Q4 decline is
primarily driven by reduced enterprise sales, which dropped
23% compared to Q3."
1. No context: Jumping straight into code
2. Too much code: Showing every exploratory step
3. No interpretation: Figures without explanation
4. Unclear flow: Random order of analyses
5. No conclusion: Analysis without recommendations
6. Assuming knowledge: Not defining terms/metrics
Remember: Your notebook is a story, not just code!
Save time by creating reusable templates:
# In your templates/ folder
# data_analysis_template.ipynb
"""
Contains:
- Standard imports
- Data loading section
- Exploratory analysis section
- Visualization section
- Results section
- Conclusion section
"""
Parameters — learned from data:
Hyperparameters — set before training:
k in KNNGoal: Find the best hyperparameters for your model
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5]
}
# Create grid search
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5, # 5-fold cross-validation
scoring='accuracy'
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Best model is already fitted
best_model = grid_search.best_estimator_
# Evaluate on test set
test_score = best_model.score(X_test, y_test)
print(f"Test accuracy: {test_score:.3f}")
# See all results
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
results[['params', 'mean_test_score', 'rank_test_score']].head()
Note: GridSearchCV tries all combinations
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# Define distributions
param_distributions = {
'n_estimators': randint(50, 300),
'max_depth': randint(3, 20),
'min_samples_split': randint(2, 10),
'min_samples_leaf': randint(1, 5)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=20, # Only try 20 random combinations
cv=5,
random_state=42
)
random_search.fit(X_train, y_train)
1. Start simple
2. Use cross-validation
3. Common parameters to tune:
| Model | Key Hyperparameters |
|---|---|
| Random Forest | n_estimators, max_depth, min_samples_split |
| KNN | n_neighbors, weights |
| Logistic Regression | C, penalty |
| SVM | C, kernel, gamma |
4. Document your search
If you're using a neural network, key hyperparameters:
| Parameter | What it does | Typical values |
|---|---|---|
| Learning rate | Step size for updates | 0.001, 0.01, 0.1 |
| Batch size | Samples per update | 16, 32, 64, 128 |
| Epochs | Training iterations | 10-100+ |
| Hidden layers | Network depth | 1-3 for simple tasks |
| Neurons per layer | Network width | 32, 64, 128 |
Simple approach: Start with defaults, then try 2-3 values for learning rate
# Example with sklearn MLPClassifier
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(64, 32),
learning_rate_init=0.001,
max_iter=200, random_state=42)
For info only — what professionals use:
| Tool | What it does |
|---|---|
| Optuna | Smart search (Bayesian optimization) |
| Ray Tune | Distributed tuning across machines |
| W&B Sweeps | Track experiments + automatic tuning |
| Keras Tuner | Built-in for TensorFlow/Keras |
| Why they exist: GridSearch doesn't scale |
Next week: Final project workshop - putting it all together!