Generate Dataframe Summaries With Python
How much time do you spend with making summaries of dataset? Too much and I don’t like doing it so I search to do it with the AI. So this is my sperimentation with some medical data see at PyDataVe 22nd event and Mistral model.
The code for the inizializzation
For start I need to install some dipendency
langchain>=0.3.27
langchain-ollama>=0.3.7
pandas>=2.3.2
import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal
df = pd.read_csv("data/test.csv")
print("-*-" * 20)
print(f"Dataset shape: {df.shape}")
print("-*-" * 20)
print("Missing value stats:")
print(df.isnull().sum())
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
Dataset shape: (418, 20)
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
Missing value stats:
ID 0
N_Days 0
Status 0
Drug 106
Age 0
Sex 0
Ascites 106
Hepatomegaly 106
Spiders 106
Edema 0
Bilirubin 0
Cholesterol 134
Albumin 0
Copper 108
Alk_Phos 106
SGOT 106
Tryglicerides 136
Platelets 11
Prothrombin 2
Stage 6
dtype: int64
This is a section of the dataset and what is missing value of the stats.
Now we will start with the AI. In my case I user Ollama with Mistral model. I install the model with
ollama run mistral
And prepare the code for use the model. First you need to make a connection with the local LLM instance. This code use Mistral but you can pass any local LLM instance you have.
def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
"""
Create and configure a ChatOllama instance for local LLM inference.
This function initializes a ChatOllama client configured to connect to a
local Ollama server. The client is set up with deterministic output
(temperature=0) for consistent responses across multiple calls with the
same input.
Parameters
----------
model_name : str, optional
The name of the Ollama model to use for chat completions.
Must be a valid model name that is available on the local Ollama
installation. Default is "mistral:latest".
Returns
-------
ChatOllama
A configured ChatOllama instance ready for chat completions.
"""
return ChatOllama(
model=model_name, base_url="http://localhost:11434", temperature=0
)
If you want to test the connection you can use this command
print(get_llm().invoke("test").content)
Hallo! Wie geht es Ihnen? Ich bin hier, um Ihnen zu helfen. Was möchten Sie heute tun?
Ich kann Ihnen beispielsweise helfen:
* Fragen beantworten
* Informationen suchen
* Aufgaben lösen
* und vieles mehr!
Welche Aufgabe haben wir heute vor uns?
Make a context
Now we need to generate a context for the LLM. If you do this function with all the necessary data you can relaunch this script every time you need a new README/summary of the dataset. This is better to be a dataset with a fixed schema and a date which change every year like medical data (this), monthly sell report, census data…
def get_summary_context_message(df: pd.DataFrame, dataset_name:str) -> str:
# Basic application statistics
total_analisys = len(df)
# Gender distribution
gender_counts = df["Sex"].value_counts()
male_count = gender_counts.get("M", 0)
female_count = gender_counts.get("F", 0)
# Stage Statistics
stage_data = df["Stage"].dropna()
stage_avg = stage_data.mean()
stage_25th = stage_data.quantile(0.25)
stage_50th = stage_data.quantile(0.50)
stage_75th = stage_data.quantile(0.75)
# NDays Statistics
days_data = df["N_Days"].dropna()
days_avg = days_data.mean()
days_25th = days_data.quantile(0.25)
days_50th = days_data.quantile(0.50)
days_75th = days_data.quantile(0.75)
def status_category(exp):
if pd.isna(exp):
return "Unkown"
elif exp == "C":
return "Censored"
elif exp == "CL":
return "Censored due to Lever tx"
elif exp == "D":
return "Death"
else:
return "Unkow"
df['Status Str']= df['Status'].apply(status_category)
status_str_stats = []
for category in ["Censored", "Censored due to Lever tx", "Death",]:
category_data = df[df["Status Str"] == category]
if len(category_data) > 0:
male = len(category_data[category_data["Sex"] == "M"])
female = len(category_data[category_data["Sex"] == "F"])
total = len(category_data)
rate_m = (male / total) * 100
rate_f = (female / total) * 100
status_str_stats.append((category, male, female, total, rate_m, rate_f))
summary =f"""{dataset_name}
Total Analisys: {total_analisys:,}
Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_analisys*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_analisys*100:.1f}%)
Stage Statistics:
- Average Stage: {stage_avg:.2f}
- 25th percentile: {stage_25th:.2f}
- 50th percentile (median): {stage_50th:.2f}
- 75th percentile: {stage_75th:.2f}
N Day Statistics:
- N Days Stage: {days_avg:.2f}
- 25th percentile: {days_25th:.2f}
- 50th percentile (median): {days_50th:.2f}
- 75th percentile: {days_75th:.2f}
"""
summary += "\n\nStatus Rates by Sex:"
for category, male, female, total, rate_m, rate_f in status_str_stats:
summary += (
f"\n- {category}: {male}/{total} Male ({rate_m:.1f}% rate)"+
f"\n- {category}: {female}/{total} Female ({rate_f:.1f}% rate)"
)
return summary
Make a report
After checking all you need to have a template for the repo of the dataset.
SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer.
Your task is to take in complex datasets and return user-friendly descriptions and findings.
You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}
This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:
<context>
{context}
</context>
Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.
Report:
"""
This prompt and a lot of the code of this article are from this post .
After this we need a function that take the dataset df, the prompt SUMMARIZE_DATAFRAME_PROMPT with the needed info and return the content of the report.
def get_report_summary(
dataset: pd.DataFrame,
dataset_name: str,
dataset_source: str,
report_format: Literal["markdown", "html"] = "markdown",
) -> str:
context_message = get_summary_context_message(df=dataset, dataset_name=dataset_name)
prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
dataset_name=dataset_name,
dataset_source=dataset_source,
context=context_message,
report_format=report_format,
)
return get_llm().invoke(input=prompt).content
In our case we launch it as
md_report = get_report_summary(
dataset=df,
dataset_name="Cirrhosis Patient Survival Prediction",
dataset_source="https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction/data"
)
print(md_report)
# Cirrhosis Patient Survival Prediction Analysis Report
## Overview
The dataset analyzed consists of 418 records related to cirrhosis patients, sourced from [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/cirrhosis-patient-survival-prediction/data). The data provides information about the patient's gender, stage of cirrhosis, number of days since diagnosis, and final status (censored or death).
## Demographics
### Gender Distribution
The dataset shows a significant imbalance in gender distribution with 89.5% female applicants (374) and only 10.5% male applicants (44).
## Cirrhosis Stage Statistics
### Average Stage
The average stage of cirrhosis for the analyzed patients is 3.02, indicating a severe level of liver damage.
### Percentiles
- **25th percentile**: The cirrhosis stage is at least 2.00 for 25% of the patients.
- **Median (50th percentile)**: Half of the patients have a cirrhosis stage of 3.00.
- **75th percentile**: For 75% of the patients, the cirrhosis stage is 4.00 or lower.
## N Days Statistics
### N Days Stage
The average number of days since diagnosis for the analyzed patients is 1917.78 days.
### Percentiles
- **25th percentile**: The minimum number of days since diagnosis for 25% of the patients is 1092.75 days.
- **Median (50th percentile)**: Half of the patients have been diagnosed with cirrhosis for at least 1730.00 days.
- **75th percentile**: For 75% of the patients, the number of days since diagnosis is 2613.50 days or less.
## Status Rates by Sex
The following table shows the rates of different statuses (censored due to Lever tx and death) for both male and female applicants:
| | Male Applicants | Female Applicants |
|---------------------|-----------------|-------------------|
| Censored | 17/232 (7.3%) | 215/232 (92.7%) |
| Censored due to Lever tx | 3/25 (12.0%) | 22/25 (88.0%) |
| Death | 24/161 (14.9%) | 137/161 (85.1%) |
The analysis indicates that female applicants are more likely to have their status censored, either due to the lack of information or other factors, while male applicants are more likely to experience death. However, it's important to note that the sample size for male applicants is significantly smaller than that of female applicants.
This post is part of the Data and Data Tools series
- Jupyter Notebook Pelican Combo
- Hugo Blog With Jupyter Notebook
- Generate Dataframe Summaries With Python
Comments
To reply to this post, you can send a Webmention or you can toot me at [email protected]
With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one. Known non-private replies are displayed below.
Learn how this is implemented here.