Claude vs DeepSeek for Data Analysis
Claude is the stronger choice for data analysis thanks to file upload capabilities and superior writing quality for explaining findings, plus slightly better performance on reasoning-heavy tasks. DeepSeek offers compelling value if cost is your primary concern and you're comfortable copy-pasting data instead of uploading files—both models handle math and statistics well, but Claude's ability to directly process datasets gives it a clear edge. For exploratory analysis where iteration matters, Claude's pricing is worth the investment; for budget-constrained projects, DeepSeek is a viable alternative.
Head-to-Head for Data Analysis
| Criteria | Claude | DeepSeek | Winner |
|---|---|---|---|
| Mathematical reasoning | Excellent with 95.6% AIME | Strong with 93.1% AIME | Claude |
| Dataset import capabilities | Supports file uploads | No file uploads | Claude |
| Statistical code generation | Superior 79.6% coding benchmark | Good 73.1% coding benchmark | Claude |
| Cost efficiency for bulk analysis | $3/$15 per 1M tokens | $0.56/$1.68 per 1M | DeepSeek |
| Complex analysis reasoning | Extended thinking available | DeepSeek R1 reasoning mode | Tie |
Mathematical reasoning
Claude
Excellent with 95.6% AIME
DeepSeek
Strong with 93.1% AIME
Dataset import capabilities
Claude
Supports file uploads
DeepSeek
No file uploads
Statistical code generation
Claude
Superior 79.6% coding benchmark
DeepSeek
Good 73.1% coding benchmark
Cost efficiency for bulk analysis
Claude
$3/$15 per 1M tokens
DeepSeek
$0.56/$1.68 per 1M
Complex analysis reasoning
Claude
Extended thinking available
DeepSeek
DeepSeek R1 reasoning mode
See Them Answer a Data Analysis Question
We asked both models the same data analysis-focused prompt.
“Analyze this dataset of customer churn records and identify the top 3 predictive features, then write a Python script to visualize the correlation matrix and generate actionable retention recommendations.”
Analyze this dataset of customer churn records and identify the top 3 predictive features, then write a Python script to visualize the correlation matrix and generate actionable retention recommendations.
Customer Churn Analysis: Feature Identification, Visualization & Retention Recommendations
Overview
I'll build a complete, well-documented pipeline that handles data loading, feature analysis, visualization, and business recommendations.
Complete Python Script
"""
Customer Churn Analysis Pipeline
================================
Identifies top predictive features, visualizes correlations,
and generates actionable retention recommendations.
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')
# ─────────────────────────────────────────────
# 1. DATA GENERATION (replace with real data)
# ─────────────────────────────────────────────
def generate_churn_dataset(n: int = 5_000, seed: int = 42) -> pd.DataFrame:
"""
Generates a realistic synthetic churn dataset.
Replace this function with your actual data loader, e.g.:
pd.read_csv('churn_data.csv')
"""
rng = np.random.default_rng(seed)
tenure = rng.integers(1, 72, n) # months
monthly_charges = rng.uniform(20, 120, n) # dollars
total_charges = tenure * monthly_charges * rng.uniform(0.9, 1.1, n)
num_products = rng.integers(1, 6, n)
support_calls = rng.integers(0, 15, n)
contract_type = rng.choice(['Month-to-Month', 'One Year', 'Two Year'], n,
p=[0.55, 0.25, 0.20])
payment_method = rng.choice(['Electronic Check', 'Mailed Check',
'Bank Transfer', 'Credit Card'], n)
internet_service= rng.choice(['DSL', 'Fiber Optic', 'None'], n,
p=[0.35, 0.45, 0.20])
age = rng.integers(18, 80, n)
satisfaction = rng.integers(1, 6, n) # 1-5 scale
login_freq = rng.integers(0, 30, n) # logins/month
# Realistic churn probability driven by features
churn_prob = (
0.40 * (tenure < 12).astype(float)
+ 0.25 * (monthly_charges > 80).astype(float)
+ 0.20 * (support_calls > 8).astype(float)
+ 0.30 * (contract_type == 'Month-to-Month').astype(float)
- 0.20 * (num_products > 3).astype(float)
- 0.15 * (satisfaction > 3).astype(float)
+ 0.10 * rng.uniform(0, 1, n) # noise
)
churn_prob = np.clip(churn_prob / churn_prob.max(), 0.05, 0.95)
churn = (rng.uniform(0, 1, n) < churn_prob).astype(int)
df = pd.DataFrame({
'CustomerID' : [f'CUST-{i:05d}' for i in range(n)],
'Tenure' : tenure,
'MonthlyCharges' : monthly_charges.round(2),
'TotalCharges' : total_charges.round(2),
'NumProducts' : num_products,
'SupportCalls' : support_calls,
'ContractType' : contract_type,
'PaymentMethod' : payment_method,
'InternetService' : internet_service,
'Age' : age,
'Satisfaction' : satisfaction,
'LoginFrequency' : login_freq,
'Churn' : churn
})
return df
# ─────────────────────────────────────────────
# 2. PREPROCESSING
# ─────────────────────────────────────────────
def preprocess(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.Series,
list[str], dict]:
"""
Encodes categoricals, scales numerics, returns feature matrix.
Returns: X, y, feature_names, encoders
"""
df = df.copy()
# ── Drop ID column ────────────────────────
df.drop(columns=['CustomerID'], errors='ignore', inplace=True)
# ── Separate target ───────────────────────
y = df.pop('Churn')
# ── Identify column types ─────────────────
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
encoders = {}
for col in cat_cols:
le = LabelEncoder()
df[col] = le.fit_transform(df[col].astype(str))
encoders[col] = le
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
encoders['scaler'] = scaler
return df, y, df.columns.tolist(), encoders
# ─────────────────────────────────────────────
# 3. FEATURE IMPORTANCE ANALYSIS
# ─────────────────────────────────────────────
def identify_top_features(X: pd.DataFrame, y: pd.Series,
top_n: int = 3) -> pd.DataFrame:
"""
Combines three importance signals for robust feature ranking:
• Random Forest impurity importance
• Gradient Boosting importance
• Permutation importance (model-agnostic)
"""
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# ── Model 1: Random Forest ─────────────────
rf = RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1)
rf.fit(X_tr, y_tr)
rf_imp = pd.Series(rf.feature_importances_, index=X.columns)
# ── Model 2: Gradient Boosting ─────────────
gb = GradientBoostingClassifier(n_estimators=200, random_state=42)
gb.fit(X_tr, y_tr)
gb_imp = pd.Series(gb.feature_importances_, index=X.columns)
# ── Model 3: Permutation Importance ────────
perm = permutation_importance(rf, X_te, y_te, n_repeats=15,
random_state=42, n_jobs=-1)
perm_imp = pd.Series(perm.importances_mean, index=X.columns)
# ── Normalise & combine ────────────────────
def norm(s): return (s - s.min()) / (s.max() - s.min() + 1e-9)
importance_df = pd.DataFrame({
'RF_Importance' : norm(rf_imp),
'GB_Importance' : norm(gb_imp),
'Perm_Importance' : norm(perm_imp),
})
importance_df['Composite_Score'] = importance_df.mean(axis=1)
importance_df.sort_values('Composite_Score', ascending=False, inplace=True)
# ── CV AUC scores ──────────────────────────
auc_rf = cross_val_score(rf, X, y, cv=5, scoring='roc_auc').mean()
auc_gb = cross_val_score(gb, X, y, cv=5, scoring='roc_auc').mean()
print(f"\n{'='*55}")
print(f" Model Performance (5-Fold CV AUC)")
print(f" Random Forest : {auc_rf:.4f}")
print(f" Gradient Boosting : {auc_gb:.4f}")
print(f"{'='*55}")
print(f"\n Top {top_n} Predictive Features:")
print(f" {'Feature':<22} {'Score':>8}")
print(f" {'-'*32}")
for feat, row in importance_df.head(top_n).iterrows():
print(f" {feat:<22} {row['Composite_Score']:>8.4f}")
print(f"{'='*55}\n")
return importance_df
# ─────────────────────────────────────────────
# 4. VISUALISATION
# ─────────────────────────────────────────────
PALETTE = {'primary': '#2E86AB', 'danger': '#E84855',
'success': '#3BB273', 'warning': '#F4A261', 'neutral': '#6C757D'}
PLT_STYLE = {'figure.facecolor': '#0F1117', 'axes.facecolor': '#1A1D2E',
'axes.edgecolor': '#3A3D52', 'text.color': '#E8E8F0',
'axes.labelcolor': '#E8E8F0', 'xtick.color': '#A0A0C0',
'ytick.color': '#A0A0C0', 'grid.color': '#2A2D3E',
'grid.alpha': 0.6}
plt.rcParams.update(PLT_STYLE)
def plot_all(df_raw: pd.DataFrame, importance_df: pd.DataFrame) -> None:
"""Master visualisation: 6 coordinated panels."""
# ── Prepare numeric-only frame ─────────────
num_df = df_raw.select_dtypes(include=[np.number])
fig = plt.figure(figsize=(22, 18))
fig.suptitle('Customer Churn Analysis Dashboard',
fontsize=22, fontweight='bold',
color='#E8E8F0', y=0.98)
gs = gridspec.GridSpec(3, 3, figure=fig,
hspace=0.45, wspace=0.40)
# ╔══════════════════════════════════════════╗
# ║ PANEL 1 – Full Correlation Matrix ║
# ╚══════════════════════════════════════════╝
ax1 = fig.add_subplot(gs[0, :2])
corr = num_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
annot=True, fmt='.2f', linewidths=0.5,
linecolor='#2A2D3E', ax=ax1,
annot_kws={'size': 8, 'color': '#E8E8F0'},
cbar_kws={'shrink': 0.8})
ax1.set_title('Feature Correlation Matrix (Lower Triangle)',
fontsize=13, pad=10, color='#E8E8F0')
ax1.tick_params(axis='x', rotation=45, labelsize=9)
ax1.tick_params(axis='y', rotation=0, labelsize=9)
# ╔══════════════════════════════════════════╗
# ║ PANEL 2 – Churn Rate Donut ║
# ╚══════════════════════════════════════════╝
ax2 = fig.add_subplot(gs[0, 2])
churn_counts = df_raw['Churn'].value_counts()
labels = [f"Retained\n{churn_counts[0]:,}", f"Churned\n{churn_counts[1]:,}"]
colors = [PALETTE['success'], PALETTE['danger']]
wedges, texts, autotexts = ax2.pie(
churn_counts, labels=labels, autopct='%1.1f%%',
colors=colors, startangle=90,
wedgeprops=dict(width=0.55, edgecolor='#1A1D2E', linewidth=2),
textprops={'color': '#E8E8F0', 'fontsize': 10})
for at in autotexts:
at.set_fontsize(11)
at.set_fontweight('bold')
ax2.set_title('Churn Distribution', fontsize=13,
pad=10, color='#E8E8F0')
# ╔══════════════════════════════════════════╗
# ║ PANEL 3 – Composite Feature Importance ║
# ╚══════════════════════════════════════════╝
ax3 = fig.add_subplot(gs[1, :2])
top10 = importance_df.head(10)
colors_bar = [PALETTE['danger'] if i < 3 else PALETTE['primary']
for i in range(len(top10))]
bars = ax3.barh(top10.index[::-1], top10['Composite_Score'][::-1],
color=colors_bar[::-1], edgecolor='none', height=0.65)
for bar, score in zip(bars, top10['Composite_Score'][::-1]):
ax3.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,
f'{score:.3f}', va='center', ha='left',
color='#E8E8F0', fontsize=9)
ax3.set_xlabel('Composite Importance Score', fontsize=10)
ax3.set_title('Feature Importance (RF + GB + Permutation)',
fontsize=13, pad=10, color='#E8E8F0')
ax3.axvline(x=0, color='#3A3D52', linewidth=0.8)
ax3.set_xlim(0, top10['Composite_Score'].max() * 1.18)
# Legend for top-3 highlight
from matplotlib.patches import Patch
legend_elems = [Patch(facecolor=PALETTE['danger'], label='Top 3 Features'),
Patch(facecolor=PALETTE['primary'], label='Other Features')]
ax3.legend(handles=legend_elems, loc='lower right',
facecolor='#1A1D2E', edgecolor='#3A3D52',
labelcolor='#E8E8F0', fontsize=9)
# ╔══════════════════════════════════════════╗
# ║ PANEL 4 – Churn by Contract Type ║
# ╚══════════════════════════════════════════╝
ax4 = fig.add_subplot(gs[1, 2])
contract_churn = (df_raw.groupby('ContractType')['Churn']
.agg(['mean', 'count'])
.rename(columns={'mean': 'ChurnRate', 'count': 'Count'}))
contract_churn['ChurnRate'] *= 100
bar_colors = [PALETTE['danger'] if r > 30 else PALETTE['warning']
if r > 15 else PALETTE['success']
for r in contract_churn['ChurnRate']]
ax4.bar(contract_churn.index, contract_churn['ChurnRate'],
color=bar_colors, edgecolor='none', width=0.55)
ax4.axhline(y=df_raw['Churn'].mean()*100, color='#A0A0C0',
linestyle='--', linewidth=1.2, label='Overall avg')
for i, (idx, row) in enumerate(contract_churn.iterrows()):
ax4.text(i, row['ChurnRate'] + 0.8, f"{row['ChurnRate']:.1f}%",
ha='center', va='bottom', color='#E8E8F0',
fontsize=10, fontweight='bold')
ax4.text(i, -3.5, f"n={row['Count']:,}", ha='center',
color='#A0A0C0', fontsize=8)
ax4.set_title('Churn Rate by Contract Type',
fontsize=13, pad=10, color='#E8E8F0')
ax4.set_ylabel('Churn Rate (%)', fontsize=10)
ax4.set_ylim(-6, contract_churn['ChurnRate'].max() * 1.2)
ax4.tick_params(axis='x', rotation=10, labelsize=9)
ax4.legend(facecolor='#1A1D2E', edgecolor='#3A3D52',
labelcolor='#E8E8F0', fontsize=9)
# ╔══════════════════════════════════════════╗
# ║ PANEL 5 – Tenure vs Monthly Charges ║
# ╚══════════════════════════════════════════╝
ax5 = fig.add_subplot(gs[2, :2])
churned = df_raw[df_raw['Churn'] == 1]
not_churned = df_raw[df_raw['Churn'] == 0]
ax5.scatter(not_churned['Tenure'], not_churned['MonthlyCharges'],
c=PALETTE['success'], alpha=0.25, s=18, label='Retained',
edgecolors='none')
ax5.scatter(churned['Tenure'], churned['MonthlyCharges'],
c=PALETTE['danger'], alpha=0.45, s=22, label='Churned',
edgecolors='none')
# Risk zone annotation
ax5.axvspan(0, 12, alpha=0.07, color=PALETTE['danger'],
label='High-Risk Tenure Zone (<12 mo)')
ax5.axhspan(80, df_raw['MonthlyCharges'].max(),
alpha=0.06, color=PALETTE['warning'],
label='High-Charge Zone (>$80)')
ax5.axvline(x=12, color=PALETTE['danger'], linestyle=':', linewidth=1.2)
ax5.axhline(y=80, color=PALETTE['warning'], linestyle=':', linewidth=1.2)
ax5.set_xlabel('Tenure (months)', fontsize=10)
ax5.set_ylabel('Monthly Charges ($)', fontsize=10)
ax5.set_title('Tenure vs Monthly Charges (Churn Overlay)',
fontsize=13, pad=10, color='#E8E8F0')
ax5.legend(facecolor='#1A1D2E', edgecolor='#3A3D52',
labelcolor='#E8E8F0', fontsize=8, ncol=2)
# ╔══════════════════════════════════════════╗
# ║ PANEL 6 – Support Calls Distribution ║
# ╚══════════════════════════════════════════╝
ax6 = fig.add_subplot(gs[2, 2])
bins = range(0, df_raw['SupportCalls'].max() + 2)
ax6.hist(not_churned['SupportCalls'], bins=bins, alpha=0.65,
color=PALETTE['success'], label='Retained',
edgecolor='#1A1D2E', linewidth=0.4, density=True)
ax6.hist(churned['SupportCalls'], bins=bins, alpha=0.65,
color=PALETTE['danger'], label='Churned',
edgecolor='#1A1D2E', linewidth=0.4, density=True)
ax6.axvline(x=not_churned['SupportCalls'].mean(),
color=PALETTE['success'], linestyle='--', linewidth=1.5,
label=f"Retained avg: {not_churned['SupportCalls'].mean():.1f}")
ax6.axvline(x=churned['SupportCalls'].mean(),
color=PALETTE['danger'], linestyle='--', linewidth=1.5,
label=f"Churned avg: {churned['SupportCalls'].mean():.1f}")
ax6.set_xlabel('Support Calls per Month', fontsize=10)
ax6.set_ylabel('Density', fontsize=10)
ax6.set_title('Support Calls Distribution', fontsize=13,
pad=10, color='#E8E8F0')
ax6.legend(facecolor='#1A1D2E', edgecolor='#3A3D52',
labelcolor='#E8E8F0', fontsize=8)
plt.savefig('churn_analysis_dashboard.png', dpi=150,
bbox_inches='tight', facecolor='#0F1117')
plt.show()
print(" Dashboard saved → churn_analysis_dashboard.png")
# ─────────────────────────────────────────────
# 5. RETENTION RECOMMENDATIONS ENGINE
# ─────────────────────────────────────────────
def generate_recommendations(df: pd.DataFrame,
importance_df: pd.DataFrame) -> None:
"""
Derives data-driven retention recommendations from the top features
and prints a structured action plan with measurable KPIs.
"""
top3 = importance_df.index[:3].tolist()
# ── Statistical summaries ──────────────────
churned = df[df['Churn'] == 1]
not_churned = df[df['Churn'] == 0]
overall_churn_rate = df['Churn'].mean() * 100
# Per-segment churn rates
early_churn = df[df['Tenure'] <= 12]['Churn'].mean() * 100
hi_charge_churn = df[df['MonthlyCharges'] > 80]['Churn'].mean() * 100
hi_calls_churn = df[df['SupportCalls'] > 8]['Churn'].mean() * 100
mtm_churn = df[df['ContractType'] == 'Month-to-Month']['Churn'].mean() * 100
low_sat_churn = df[df['Satisfaction'] <= 2]['Churn'].mean() * 100
banner = "═" * 60
print(f"\n{banner}")
print(f" ACTIONABLE RETENTION RECOMMENDATIONS")
print(f" Based on top features: {', '.join(top3)}")
print(f"{banner}")
print(f" Overall Churn Rate: {overall_churn_rate:.1f}%")
print(f"{banner}\n")
recs = [
# ── REC 1 ────────────────────────────────
{
"title" : "🚨 PRIORITY 1 — Early-Tenure Onboarding Program",
"insight" : (
f"Customers with tenure ≤ 12 months churn at {early_churn:.1f}% "
f"vs {overall_churn_rate:.1f}% overall — "
f"{early_churn/overall_churn_rate:.1f}× the baseline risk."
),
"actions" : [
"Assign a dedicated Success Manager for first 90 days",
"Trigger automated check-in emails at Day 7, 30, 60, 90",
"Offer a 'Loyalty Bonus' discount at the 6-month milestone",
"Build an interactive onboarding wizard reducing time-to-value",
"Create a peer community / user group for new customers",
],
"kpis" : [
"Target: reduce <12-month churn by 25% within 2 quarters",
"Track: 30/60/90-day activation rates",
"Metric: Net Promoter Score at month 3",
],
"effort" : "Medium | Impact: HIGH",
},
# ── REC 2 ────────────────────────────────
{
"title" : "💰 PRIORITY 2 — Value-Realignment for High-Charge Customers",
"insight" : (
f"Customers paying >$80/month churn at {hi_charge_churn:.1f}%. "
"Price sensitivity compounds with low satisfaction — "
"a toxic combination."
),
"actions" : [
"Proactively audit plans — migrate customers to better-fit tiers",
"Introduce mid-tier 'Best Value' bundle to reduce sticker shock",
"Send personalised 'Value Reports' showing ROI vs spend each quarter",
"Offer loyalty discounts (5-15%) at 12-month anniversaries",
"Create a transparent pricing page with competitor comparison",
],
"kpis" : [
"Target: reduce high-charge segment churn by 20%",
"Track: average revenue per churned account (ARPA)",
"Metric: plan upgrade/downgrade ratio monthly",
],
"effort" : "Medium | Impact: HIGH",
},
# ── REC 3 ────────────────────────────────
{
"title" : "📞 PRIORITY 3 — Support Experience Transformation",
"insight" : (
f"High-support customers (>8 calls/month) churn at "
f"{hi_calls_churn:.1f}%. Each extra call signals "
"recurring product friction and rising frustration."
),
"actions" : [
"Flag accounts with >5 calls/month for proactive outreach",
"Implement root-cause tagging on every support ticket",
"Build a self-service knowledge base targeting top 10 issue types",
"Deploy AI chatbot for tier-1 deflection (target 40% deflection)",
"Introduce 'Support Health Score' — alert CSMs when score drops",
"Offer dedicated support queue for high-value at-risk accounts",
],
"kpis" : [
"Target: reduce repeat-contact rate by 30% in 6 months",
"Track: First-Contact Resolution (FCR) rate",
"Metric: average calls per customer per month",
],
"effort" : "Low-Med | Impact: HIGH",
},
# ── REC 4 ────────────────────────────────
{
"title" : "📋 PRIORITY 4 — Contract Conversion Campaign",
"insight" : (
f"Month-to-Month customers churn at {mtm_churn:.1f}%. "
"Long-term contracts are the single strongest churn barrier."
),
"actions" : [
"Offer 10-20% discount for annual contract commitment",
"Bundle premium features exclusively into longer contracts",
"Run a 'Lock-In Loyalty' campaign targeting MTM customers >6 months",
"Use in-app prompts with social proof ('87% of users on annual plan')",
],
"kpis" : [
"Target: convert 15% of MTM base to annual contracts per quarter",
"Track: Monthly Recurring Revenue (MRR) lock-in rate",
"Metric: contract-length distribution shift",
],
"effort" : "Low | Impact: VERY HIGH",
},
# ── REC 5 ────────────────────────────────
{
"title" : "⭐ PRIORITY 5 — Satisfaction-Driven Rescue Program",
"insight" : (
f"Customers with satisfaction score ≤ 2 churn at "
f"{low_sat_churn:.1f}%. A Rescue Program can recover "
"30-40% of at-risk accounts when triggered early."
),
"actions" : [
"Send CSAT / NPS surveys at key lifecycle events",
"Auto-trigger 'Rescue Workflow' when CSAT drops below 3",
"Empower frontline reps with retention offers (service credit, upgrade)",
"Executive 'Save Call' for strategic accounts scoring ≤ 2",
"Post-resolution follow-up at 30 days to confirm issue resolved",
],
"kpis" : [
"Target: rescue 35% of flagged at-risk accounts",
"Track: NPS trend month-over-month",
"Metric: post-rescue 6-month retention rate",
],
"effort" : "Medium | Impact: HIGH",
},
]
# ── Print each recommendation ──────────────
for rec in recs:
print(f" {rec['title']}")
print(f" {'─'*56}")
print(f" 📊 Insight: {rec['insight']}")
print(f"\n ✅ Actions:")
for a in rec['actions']:
print(f" • {a}")
print(f"\n 📈 KPIs:")
for k in rec['kpis']:
print(f" • {k}")
print(f"\n ⚙️ Effort / Impact: {rec['effort']}")
print()
# ── Churn cost model ───────────────────────
avg_revenue = df['MonthlyCharges'].mean()
avg_ltv = avg_revenue * df['Tenure'].mean()
n_churned = int(df['Churn'].sum())
print(f"{banner}")
print(f" BUSINESS IMPACT SNAPSHOT")
print(f"{banner}")
print(f" Customers churned : {n_churned:,}")
print(f" Avg monthly charge : ${avg_revenue:.2f}")
print(f" Avg customer LTV : ${avg_ltv:,.0f}")
print(f" Est. revenue at risk : ${n_churned * avg_revenue:,.0f} / month")
print(f" 5% retention improvement: ${n_churned * avg_revenue * 0.05:,.0f} saved/month")
print(f"{banner}\n")
# ─────────────────────────────────────────────
# 6. MAIN PIPELINE
# ─────────────────────────────────────────────
def main():
print("\n🔄 Loading data...")
df_raw = generate_churn_dataset() # ← swap in your real DataFrame
print(f" Dataset shape : {df_raw.shape}")
print(f" Churn rate : {df_raw['Churn'].mean()*100:.1f}%\n")
print("🔄 Preprocessing...")
X, y, features, encoders = preprocess(df_raw)
print("🔄 Identifying top predictive features...")
importance_df = identify_top_features(X, y, top_n=3)
print("🔄 Generating visualisations...")
Try data analysis tasks with both models
See Claude and DeepSeek answer side by side in Multichat
Detailed Breakdown
When it comes to data analysis, Claude and DeepSeek take meaningfully different approaches — and the right choice depends heavily on your workflow and data sensitivity needs.
Claude's biggest advantage for data analysis is its ability to accept file uploads. You can drop in a CSV, spreadsheet, or PDF report and immediately ask Claude to interpret trends, identify anomalies, or summarize key findings. This makes it practical for analysts who need to work directly with raw data without writing a single line of code. Pair that with Claude's precise instruction-following and you get structured, well-reasoned outputs — whether you're asking for a five-point executive summary or a detailed breakdown of variance across quarters. Claude's extended thinking mode is particularly useful for complex analytical tasks, like identifying causation vs. correlation or stress-testing assumptions in a financial model.
DeepSeek, by contrast, cannot accept file uploads, which is a real limitation for everyday data analysis work. You would need to paste data directly into the prompt — workable for small datasets, but impractical at scale. Where DeepSeek does shine is in math-heavy and statistical reasoning. Its benchmark scores on AIME 2025 (93.1%) are competitive with Claude (95.6%), and its open-source R1 variant is purpose-built for deep reasoning tasks. For analysts comfortable working via API who need to run large volumes of analytical queries at minimal cost, DeepSeek's pricing (~$0.56/1M input tokens vs. Claude's ~$3.00) makes it an attractive backend option.
For real-world use cases: if you're a business analyst uploading monthly sales reports and asking for narrative summaries and trend identification, Claude is the clear choice. If you're a data scientist building a pipeline that needs to interpret structured query outputs or run statistical reasoning at scale via API, DeepSeek's cost efficiency becomes genuinely compelling — as long as your data isn't sensitive (its China-based hosting raises legitimate concerns for enterprise or regulated-industry data).
Privacy is a non-trivial factor. Claude, operated by Anthropic with US-based infrastructure and strong enterprise data policies, is the safer choice for confidential or proprietary datasets. DeepSeek's hosting raises compliance red flags for healthcare, legal, and financial use cases.
Recommendation: For most data analysts, Claude wins outright — file upload support alone is decisive, and its analytical writing quality is excellent. DeepSeek is worth considering only if you're working API-first, operating with non-sensitive data, and need to optimize aggressively for cost.
Frequently Asked Questions
Other Topics for Claude vs DeepSeek
Data Analysis Comparisons for Other Models
Try data analysis tasks with Claude and DeepSeek
Compare in Multichat — freeJoin 10,000+ professionals who use Multichat