Scientific Computing in Python
Scientific Computing in Python
This lesson introduces the core scientific computing libraries that form the foundation of data science in Python: NumPy for numerical computations and Pandas for data manipulation.
Learning Objectives
By the end of this lesson, you will:
- Create and manipulate NumPy arrays for efficient numerical computing
- Apply vectorization techniques for performance optimization
- Use Pandas for data cleaning, transformation, and analysis
- Perform grouping and aggregation operations on datasets
- Work effectively with Jupyter notebooks for exploratory data analysis
- Understand the scientific computing workflow for research
NumPy: Numerical Computing Foundation
Why NumPy?
- Performance: Operations are implemented in C, much faster than pure Python
- Vectorization: Operate on entire arrays without explicit loops
- Memory Efficient: Homogeneous data types, compact storage
- Foundation: Base for other scientific libraries (Pandas, SciPy, scikit-learn)
NumPy Arrays Basics
import numpy as np
# Creating arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
# Array properties
print(arr1.shape) # (5,)
print(arr2.shape) # (2, 3)
print(arr1.dtype) # int64
print(arr2.ndim) # 2
# Creating special arrays
zeros = np.zeros((3, 4))
ones = np.ones((2, 5))
identity = np.eye(3)
random_data = np.random.random((100, 5))
Array Operations and Vectorization
# Mathematical operations (vectorized)
prices = np.array([100, 105, 98, 110, 103])
returns = (prices[1:] - prices[:-1]) / prices[:-1]
# Broadcasting
portfolio_weights = np.array([0.3, 0.2, 0.5])
asset_returns = np.array([[0.05, 0.02, 0.08],
[0.03, 0.04, 0.06],
[-0.01, 0.03, 0.04]])
portfolio_returns = np.sum(asset_returns * portfolio_weights, axis=1)
# Statistical operations
mean_return = np.mean(portfolio_returns)
volatility = np.std(portfolio_returns)
correlation_matrix = np.corrcoef(asset_returns.T)
Advanced NumPy Features
# Boolean indexing
high_returns = portfolio_returns[portfolio_returns > 0.04]
# Fancy indexing
top_performers = asset_returns[:, [0, 2]] # Select columns 0 and 2
# Linear algebra
covariance_matrix = np.cov(asset_returns.T)
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
# Financial calculations
def calculate_sharpe_ratio(returns, risk_free_rate=0.02):
excess_returns = returns - risk_free_rate
return np.mean(excess_returns) / np.std(excess_returns)
Pandas: Data Manipulation and Analysis
Core Data Structures
import pandas as pd
# Series: 1-dimensional labeled array
stock_prices = pd.Series([100, 105, 98, 110, 103],
index=['2025-01-01', '2025-01-02', '2025-01-03',
'2025-01-04', '2025-01-05'])
# DataFrame: 2-dimensional labeled data structure
portfolio_data = pd.DataFrame({
'Symbol': ['AAPL', 'GOOGL', 'MSFT', 'AMZN'],
'Price': [150.25, 2800.50, 310.75, 3400.25],
'Shares': [100, 10, 50, 5],
'Sector': ['Technology', 'Technology', 'Technology', 'Consumer']
})
Data Loading and Inspection
# Reading data from various sources
df = pd.read_csv('financial_data.csv')
df = pd.read_excel('portfolio.xlsx')
df = pd.read_json('market_data.json')
# Basic inspection
print(df.head()) # First 5 rows
print(df.tail(3)) # Last 3 rows
print(df.info()) # Data types and memory usage
print(df.describe()) # Statistical summary
print(df.shape) # (rows, columns)
print(df.columns.tolist()) # Column names
Data Cleaning and Transformation
# Handling missing data
df_clean = df.dropna() # Remove rows with any NaN
df_filled = df.fillna(method='ffill') # Forward fill
df_interpolated = df.interpolate() # Linear interpolation
# Data type conversion
df['Date'] = pd.to_datetime(df['Date'])
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
# String operations
df['Symbol'] = df['Symbol'].str.upper()
df['Company'] = df['Company'].str.replace('Inc.', 'Inc')
# Creating new columns
df['Market_Value'] = df['Price'] * df['Shares']
df['Weight'] = df['Market_Value'] / df['Market_Value'].sum()
# Date/time operations
df.set_index('Date', inplace=True)
df['Year'] = df.index.year
df['Month'] = df.index.month
df['Quarter'] = df.index.quarter
Data Selection and Filtering
# Column selection
prices = df['Price']
subset = df[['Symbol', 'Price', 'Shares']]
# Row selection
first_10 = df.head(10)
last_month = df.loc['2025-01-01':'2025-01-31']
# Boolean filtering
high_value = df[df['Market_Value'] > 10000]
tech_stocks = df[df['Sector'] == 'Technology']
recent_data = df[df.index > '2025-01-01']
# Complex filtering
large_tech = df[(df['Sector'] == 'Technology') &
(df['Market_Value'] > 5000)]
Grouping and Aggregation
# Group by single column
sector_analysis = df.groupby('Sector').agg({
'Market_Value': ['sum', 'mean', 'count'],
'Price': ['min', 'max'],
'Shares': 'sum'
})
# Group by multiple columns
monthly_sector = df.groupby([df.index.to_period('M'), 'Sector']).agg({
'Price': 'mean',
'Volume': 'sum'
})
# Custom aggregation functions
def portfolio_metrics(group):
return pd.Series({
'total_value': group['Market_Value'].sum(),
'avg_price': group['Price'].mean(),
'price_volatility': group['Price'].std(),
'stock_count': len(group)
})
sector_metrics = df.groupby('Sector').apply(portfolio_metrics)
Time Series Operations
# Resampling time series data
daily_data = df.resample('D').mean() # Daily averages
monthly_data = df.resample('M').last() # Month-end values
quarterly_data = df.resample('Q').agg({ # Quarterly aggregation
'Price': 'last',
'Volume': 'sum'
})
# Rolling calculations
df['MA_20'] = df['Price'].rolling(window=20).mean()
df['Volatility_30'] = df['Returns'].rolling(window=30).std()
df['Cumulative_Return'] = (1 + df['Returns']).cumprod()
# Lag and lead operations
df['Price_Lag1'] = df['Price'].shift(1)
df['Price_Lead1'] = df['Price'].shift(-1)
df['Price_Change'] = df['Price'] - df['Price_Lag1']
Jupyter Notebooks: Interactive Computing
Why Jupyter Notebooks?
- Interactive Development: Test code incrementally
- Rich Output: Display plots, tables, and formatted text
- Documentation: Combine code, explanations, and results
- Reproducibility: Share complete analysis workflows
Best Practices for Jupyter
# Cell organization
# 1. Import all libraries at the top
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 2. Define functions in separate cells
def calculate_portfolio_return(weights, returns):
return np.sum(weights * returns)
# 3. Load and explore data
df = pd.read_csv('data.csv')
df.head()
# 4. Analysis in logical steps
# Data cleaning
# Exploratory analysis
# Visualization
# Statistical analysis
# Conclusions
Jupyter Magic Commands
# Timing code execution
%time result = expensive_function()
%%time
# Time entire cell
for i in range(1000000):
pass
# Load external Python files
%load financial_functions.py
# Display matplotlib plots inline
%matplotlib inline
# Get help
?pd.DataFrame
??np.array # Show source code
Scientific Computing Workflow
Typical Research Workflow
- Data Acquisition: Load from files, APIs, databases
- Data Exploration: Understand structure, quality, patterns
- Data Cleaning: Handle missing values, outliers, inconsistencies
- Feature Engineering: Create new variables, transformations
- Analysis: Statistical tests, modeling, machine learning
- Visualization: Create plots and charts for insights
- Reporting: Document findings and methodology
Performance Considerations
# Vectorization vs. loops
# Slow: explicit loop
total = 0
for value in large_array:
total += value ** 2
# Fast: vectorized operation
total = np.sum(large_array ** 2)
# Memory efficiency with chunks
def process_large_dataset(filename):
chunk_size = 10000
results = []
for chunk in pd.read_csv(filename, chunksize=chunk_size):
processed_chunk = process_chunk(chunk)
results.append(processed_chunk)
return pd.concat(results, ignore_index=True)
Hands-on Activities
Activity 1: Portfolio Analysis with NumPy
- Create a multi-asset portfolio simulation
- Calculate returns, volatility, and correlations
- Implement portfolio optimization functions
Activity 2: Financial Data Analysis with Pandas
- Load historical stock price data
- Clean and transform the data
- Calculate technical indicators
- Perform sector-based analysis
Activity 3: Jupyter Notebook Report
- Create a complete analysis notebook
- Include data loading, cleaning, analysis, and visualization
- Add markdown explanations and conclusions
- Ensure reproducibility
Integration with Previous Concepts
Object-Oriented Design with NumPy/Pandas
class PortfolioAnalyzer:
def __init__(self, data):
self.data = data
self.returns = None
self.weights = None
def calculate_returns(self):
self.returns = self.data.pct_change().dropna()
return self.returns
def set_weights(self, weights):
self.weights = np.array(weights)
def portfolio_return(self):
return np.sum(self.returns.mean() * self.weights)
def portfolio_volatility(self):
cov_matrix = self.returns.cov()
return np.sqrt(np.dot(self.weights.T,
np.dot(cov_matrix, self.weights)))
Using AI for Scientific Computing
- Ask AI to explain complex NumPy operations
- Generate Pandas code for specific data transformations
- Get help with debugging vectorization issues
- Create test datasets for experimentation
Assessment
This lesson contributes to:
- Proficiency in NumPy for numerical computing
- Skills in Pandas for data manipulation
- Effective use of Jupyter notebooks
- Foundation for advanced data science techniques
Next Steps
With Python fundamentals and scientific computing skills established, we’ll move into Statistical Learning concepts, starting with visualization and the statistical learning framework.