{
"cells": [
{
"cell_type": "markdown",
"id": "939e1de3-3987-422e-8a5b-c5000905de11",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"vscode": {
"languageId": "plaintext"
}
},
"source": [
"# Datenanalyse Teil I\n",
"\n",
"\n",
"### Natalie Widmann\n",
"\n",
"\n",
"\n",
"\n",
"Wintersemester 2023 / 2024\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "a8d16a0f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Datenanalyse und -verarbeitung in Python\n",
"\n",
"\n",
"### Ziele\n",
"\n",
"- Verständnis der Datenverarbeitung\n",
"- strukturierte Daten bearbeiten und analysieren mit Pandas\n",
"- Visualisierung von Daten\n",
"- Python Packages verwenden\n",
"- unterschiedliche Datenformate (csv, json, excel, txt) einlesen und speichern"
]
},
{
"cell_type": "markdown",
"id": "a072b5a7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "ecac78fb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Was sind Daten?\n",
"\n",
"\n",
"### Strukturierte Daten\n",
"\n",
"Strukturierte Daten sind gut organisiert und so formattiert, dass es einfach ist sie zu durchsuchen, sie maschinell zu lesen oder zu verarbeiten. Das einfachste Beispiel ist eine Tabelle in der jede Spalte eine Kategorie oder einen Wert festlegt. \n",
"\n",
"\n",
"### Unstrukturierte Daten\n",
"\n",
"Im Gegensatz dazu sind unstrukturierte Daten nicht in einem bestimmten Format oder einer festgelegten Struktur verfügbar. Dazu zählen Texte, Bilder, Social Media Feeds, aber auch Audio Files, etc.\n",
"\n",
"\n",
"### Semi-Strukturierte Daten\n",
"\n",
"Semi-strukturierte Daten bilden eine Mischform. Beispielsweise eine Tabelle mit E-Mail Daten, in der Empfänger, Betreff, Datum und Absender strukturierte Informationen enthalten, der eigentliche Text jedoch unstrukturiert ist. "
]
},
{
"cell_type": "markdown",
"id": "218a0599",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Was sind Daten?\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "d32aa6b1",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas\n",
"\n",
"\n",
"[Pandas](https://pandas.pydata.org/) ist ein Python Package und ist abgeleitet aus \"Python and data analysis\".\n",
"\n",
"Pandas stellt die Grundfunktionalitäten für das Arbeiten mit strukturierten Daten zur Verfügung.\n",
"\n",
"\n",
"\n",
"\n",
"Photo by Stone Wang on Unsplash\n",
" "
]
},
{
"cell_type": "markdown",
"id": "926ab387",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Python Packages\n",
"\n",
"Packages, auch Module genannt, sind vorgefertigte Code-Pakete, deren Funktionen wir wir verwenden können ohne diese selbst programmieren zu müssen.\n",
"\n",
"Manche Packages sind in Python vorinstalliert und müssen nur noch importiert werden, wie beispielsweise\n",
"- [math](https://docs.python.org/3/library/math.html)\n",
"- [random](https://docs.python.org/3/library/random.html)\n",
"- [datetime](https://docs.python.org/3/library/datetime.html)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "376c9a71",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Zufälliger Integer Wert mit random package\n",
"import random\n",
"random.randint(10,20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7574c793",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# das heutige Datum über datetime ausgeben lassen\n",
"import datetime\n",
"datetime.datetime.today()"
]
},
{
"cell_type": "markdown",
"id": "763c00c7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Installation von Python Packages\n",
"\n",
"Packages die von der Python Community zur Verfügung gestellt werden, müssen vor der Verwendung installiert werden. Dafür kann [`pip`](https://pypi.org/project/pip/) als Packagemanager verwendet werden.\n",
"\n",
"Tipps für die Installation von Python Packages in Windows, Linux und Mac gibt es hier: https://packaging.python.org/en/latest/tutorials/installing-packages/\n",
"\n",
"In Jupyter Notebooks können Packages wie folgt installiert werden:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75c9c32d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Install a pip package im Jupyter Notebook\n",
"import sys\n",
"!pip install pandas\n",
"!pip install openpyxl"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b097f6e1",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas"
]
},
{
"cell_type": "markdown",
"id": "8cbd6538",
"metadata": {},
"source": [
"## Idee, Daten finden & Verifikation\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "78518cf8",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Aggregated figures for Natural Disasters in EM-DAT\n",
"\n",
"Link: https://data.humdata.org/dataset/emdat-country-profiles\n",
"\n",
"\n",
"In 1988, the **Centre for Research on the Epidemiology of Disasters (CRED)** launched the **Emergency Events Database (EM-DAT)**. EM-DAT was created with the initial support of the **World Health Organisation (WHO) and the Belgian Government**.\n",
"\n",
"The main objective of the database is to **serve the purposes of humanitarian action at national and international levels**. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.\n",
"\n",
"EM-DAT contains essential core data on the **occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day**. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d6d4252",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"pd.set_option('display.float_format', '{:.2f}'.format)\n",
"\n",
"data = pd.read_excel('../../data/emdat.xlsx', engine=\"openpyxl\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37d9cce3",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data"
]
},
{
"cell_type": "markdown",
"id": "7f40d600",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Datenexploration und -bereinigung\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "cc9f683e",
"metadata": {},
"source": [
"### Überblick über die Daten"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b97fdf8c",
"metadata": {},
"outputs": [],
"source": [
"# head() gibt die ersten 5 Zeilen aus\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"id": "7dc49149",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Wie groß ist der Datensatz? Wie viele Zeilen und wie viele Spalten sind vorhanden?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bf897d90",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3012922",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"print(f'Anzahl an Zeilen: {data.shape[0]}')\n",
"print(f'Anzahl an Spalten: {data.shape[1]}')"
]
},
{
"cell_type": "markdown",
"id": "5d881c06",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Die Spaltennamen"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f25be8ad",
"metadata": {},
"outputs": [],
"source": [
"print(data.columns)"
]
},
{
"cell_type": "markdown",
"id": "1524e2ba",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`info()` für mehr Infos über die Spalten"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dcc799e7",
"metadata": {},
"outputs": [],
"source": [
"data.info()"
]
},
{
"cell_type": "markdown",
"id": "247febdd",
"metadata": {},
"source": [
"`describe()` zeigt die grundlegenden statistischen Eigenschaften von Spalten mit numerischem Datentyp, also `int` und `float`. \n",
"\n",
"Die Methode berechnet:\n",
"- die Anzahl an fehlenden Werten\n",
"- Durchschnitt\n",
"- Standardabweichung\n",
"- Zahlenrange\n",
"- Media\n",
"- 0.25 und 0.75 Quartile"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ac15f1c",
"metadata": {},
"outputs": [],
"source": [
"data.describe()"
]
},
{
"cell_type": "markdown",
"id": "2097ee73",
"metadata": {},
"source": [
"`.unique()` zeigt die unterschiedlichen Werte einer Spalte an"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef64293a",
"metadata": {},
"outputs": [],
"source": [
"data['Year'].unique()"
]
},
{
"cell_type": "markdown",
"id": "97f571f5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Data Cleaning: erste Zeile im DataFrame entfernen"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0d644aa",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c52dd67",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.drop(index=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1e953ad",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "849f9bf0-4cb3-4fce-b9ca-8478bc65efb8",
"metadata": {},
"outputs": [],
"source": [
"data = data.drop(index=0)\n",
"# data.drop(index=0, inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "3a37d84b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Datentypen abfragen und anpassen"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a818728",
"metadata": {
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"data.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e99a323",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Datentyp Abfrage mit dem Attribut\n",
"data['Year'].dtype"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b8fa26b",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Umwandlung des Datentyp\n",
"data[\"Year\"] = pd.to_numeric(data[\"Year\"])\n",
"data['Year'].dtype"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eeaf4901",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Auf alle integer und float Spalten anwenden\n",
"cols = ['Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, adjusted)']\n",
"for col in cols:\n",
" data[col] = pd.to_numeric(data[col])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "959fc091",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.info()"
]
},
{
"cell_type": "markdown",
"id": "87c3f21f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Überblick über die numerischen Daten"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08a33f19",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.describe()"
]
},
{
"cell_type": "markdown",
"id": "c7a2d7ed",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Überblick über die Objekt Daten"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d34411d",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Unterschiedliche Länder\n",
"countries = data['Country'].unique()\n",
"countries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6142f9d2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"len(countries)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71987dce",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Vorkommen von Ländern der Liste\n",
"'Germany' in countries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72add74d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Vorkommen von Deutschland\n",
"for country in countries:\n",
" if 'german' in country.lower():\n",
" print(country)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4c4f414",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
" data['Disaster Group'].unique()"
]
},
{
"cell_type": "markdown",
"id": "4f3a83af",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`.value_counts()` zeigt wie oft eine Spalte die unterschiedlichen Werte annimmt."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "38cc799e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"data['Disaster Subroup'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "4fa0ce6c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mit dem Argument `normalize=True` wird das Vorkommen der Werte automatisch ins Verhältnis gesetzt."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d14a43bc",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data['Disaster Subroup'].value_counts(normalize=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a19b4cc",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"data['Disaster Type'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b3170f7",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data['Disaster Type'].value_counts(normalize=True)"
]
},
{
"cell_type": "markdown",
"id": "20ca1ba6",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"### Dataframes Sortieren\n",
"\n",
"Dataframes können anhand einer oder meherer Spalten sortiert werden.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "becc18ac",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data.sort_values(by=\"Total Affected\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d901776c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# 10 schlimmsten Naturkatastrophen\n",
"data.sort_values(by=\"Total Affected\", ascending=False).head(n=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00f3550e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# Mehrere Argumente zum Sortieren sind möglich\n",
"data.sort_values(by=[\"Disaster Type\", \"Total Affected\"], ascending=[True, False]).head(n=10)"
]
},
{
"cell_type": "markdown",
"id": "ffe6c7d9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Indexing and Retriving Data\n",
"\n",
"Auf die Werte einer Spalte kann `['']` zugegriffen werden."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8faa0b8c",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data['Year']"
]
},
{
"cell_type": "markdown",
"id": "ea41eb37",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Darauf können weitere Operationen oder Methoden angewendet werden:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "340c4aba",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data['Year'] + 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "68816f29",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data['Year'].mean()"
]
},
{
"cell_type": "markdown",
"id": "0f3ac82c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Mehrere Spalten werden ausgewählt indem eine Liste von Spaltennamen übergeben wird"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b89871ff",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data[['Year', 'Country', 'Disaster Type', 'Total Affected']]"
]
},
{
"cell_type": "markdown",
"id": "82727efa",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Boolean Indexing\n",
"\n",
"Die ausgewählten Daten können auch gefilteret werden, in dem eine Bedingung mitgegeben wird.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab4cdd8f",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"data[data['Country'] == 'Germany']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89a63f73",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"data[data['Total Deaths'] >= 1000]"
]
},
{
"cell_type": "markdown",
"id": "c9996b17",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Wie viele Menschen sind im Schnitt pro Erdbeben betroffen?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c08482ee",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"data[data['Disaster Type'] == 'Earthquake']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a97b2d1f",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"data[data['Disaster Type'] == 'Earthquake']['Total Affected']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eaf4545e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"data[data['Disaster Type'] == 'Earthquake']['Total Affected'].mean()"
]
},
{
"cell_type": "markdown",
"id": "f971c41c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Weitere Recherchefragen\n",
"\n",
"- Wie viele Naturkatastrophen gab es in Deutschland?\n",
"- In welchem Jahr gabe es die meisten Naturkatastrophen?\n",
"- Welche Länder sind am stärksten von Naturkatastrophen betroffen?\n",
"- Welche Länder sind von Naturkatastrophen betroffen haben aber vergleichsweise geringe Todesfälle?\n",
"- Welche Naturkatastrophen sind am tödlichsten?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc62bc04-c8bf-418b-8a55-5fe5a504bb6f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}