Datenanalyse Teil I

Datenanalyse Teil I#

Natalie Widmann#

Wintersemester 2024 / 2025

Datenanalyse und -verarbeitung in Python#

Ziele#

Verständnis der Analyseschritte im Datenjournalismus
Eplorative Analyse von strukturierten Daten
Grundkenntnisse bei der Visualisierung von Daten
Kennenlernen von Python Package Pandas

Datenpipeline

Was sind Daten?#

Strukturierte Daten#

Strukturierte Daten sind gut organisiert und so formattiert, dass es einfach ist sie zu durchsuchen, sie maschinell zu lesen oder zu verarbeiten. Das einfachste Beispiel ist eine Tabelle in der jede Spalte eine Kategorie oder einen Wert festlegt.

Unstrukturierte Daten#

Im Gegensatz dazu sind unstrukturierte Daten nicht in einem bestimmten Format oder einer festgelegten Struktur verfügbar. Dazu zählen Texte, Bilder, Social Media Feeds, aber auch Audio Files, etc.

Semi-Strukturierte Daten#

Semi-strukturierte Daten bilden eine Mischform. Beispielsweise eine Tabelle mit E-Mail Daten, in der Empfänger, Betreff, Datum und Absender strukturierte Informationen enthalten, der eigentliche Text jedoch unstrukturiert ist.

Was sind Daten?#

Daten

Pandas#

Pandas ist ein Python Package und ist abgeleitet aus “Python and data analysis”.

Pandas stellt die Grundfunktionalitäten für das Arbeiten mit strukturierten Daten zur Verfügung.

Installation von Python Packages#

Packages die von der Python Community zur Verfügung gestellt werden, müssen vor der Verwendung installiert werden. Dafür können Packagemanager wie pip verwendet werden.

Weitere Tipps für die Installation von Python Packages in Windows, Linux und Mac gibt es hier.

In Jupyter Notebooks können Packages wie folgt installiert werden:

# Install a pip package im Jupyter Notebook
import sys
!pip install pandas
!pip install openpyxl

Requirement already satisfied: pandas in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (2.2.3)
Requirement already satisfied: numpy>=1.23.2 in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: openpyxl in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (3.1.5)
Requirement already satisfied: et-xmlfile in /home/natalie-widmann/anaconda3/lib/python3.11/site-packages (from openpyxl) (1.1.0)

import pandas as pd

pd.set_option('display.float_format', '{:.2f}'.format)

Idee, Daten finden & Verifikation#

Aggregated figures for Natural Disasters in EM-DAT#

Link: https://data.humdata.org/dataset/emdat-country-profiles

In 1988, the Centre for Research on the Epidemiology of Disasters (CRED) launched the Emergency Events Database (EM-DAT). EM-DAT was created with the initial support of the World Health Organisation (WHO) and the Belgian Government.

The main objective of the database is to serve the purposes of humanitarian action at national and international levels. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.

EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.

Was ist die Geschichte?#

Mögliche Ansätze / Fragen an die Daten#

Steigt die Anzahl an Naturkatastrophen weltweit?
In welchem Jahr gabe es die meisten Naturkatastrophen?
Welche Länder sind am stärksten von Naturkatastrophen betroffen?
Wie ist die Situation in Deutschland?
Welche Länder sind von Naturkatastrophen betroffen haben aber vergleichsweise geringe Todesfälle?
Welche Naturkatastrophen sind am tödlichsten?

Daten einlesen mit Pandas#

siehe auch:

data_url = "https://data.humdata.org/dataset/74163686-a029-4e27-8fbf-c5bfcd13f953/resource/c5ce40d6-07b1-4f36-955a-d6196436ff6b/download/emdat-country-profiles_2024_12_02.xlsx"
data = pd.read_excel(data_url, engine="openpyxl")
data

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
0	#date +occurred	#country +name	#country +code	#cause +group	#cause +subgroup	#cause +type	#cause +subtype	#frequency	#affected +ind	#affected +ind +killed	NaN	#value +usd	NaN
1	2000	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.00	88473	56.51
2	2000	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	1	100	28	NaN	NaN	56.51
3	2000	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	56.51
4	2000	Angola	AGO	Natural	Hydrological	Flood	Flood (General)	3	9011	15	NaN	NaN	56.51
...	...	...	...	...	...	...	...	...	...	...	...	...	...
6141	2024	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	3	3747257	354	2000000000.00	NaN	NaN
6142	2024	Yemen	YEM	Natural	Hydrological	Flood	Flash flood	1	1075	40	NaN	NaN	NaN
6143	2024	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	2	210439	62	NaN	NaN	NaN
6144	2024	Zambia	ZMB	Natural	Climatological	Drought	Drought	1	6600000	NaN	NaN	NaN	NaN
6145	2024	Zimbabwe	ZWE	Natural	Climatological	Drought	Drought	1	7600000	NaN	NaN	NaN	NaN

6146 rows × 13 columns

Datenexploration und -bereinigung#

Überblick über die Daten#

Wie groß ist der Datensatz?
Wie viele Zeilen und wie viele Spalten sind vorhanden?

siehe auch:#

data.shape

(6146, 13)

Die Spaltennamen

data.columns

Index(['Year', 'Country', 'ISO', 'Disaster Group', 'Disaster Subroup',
       'Disaster Type', 'Disaster Subtype', 'Total Events', 'Total Affected',
       'Total Deaths', 'Total Damage (USD, original)',
       'Total Damage (USD, adjusted)', 'CPI'],
      dtype='object')

info() für mehr Infos über die Spalten

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          6146 non-null   object 
 1   Country                       6146 non-null   object 
 2   ISO                           6146 non-null   object 
 3   Disaster Group                6146 non-null   object 
 4   Disaster Subroup              6146 non-null   object 
 5   Disaster Type                 6146 non-null   object 
 6   Disaster Subtype              6146 non-null   object 
 7   Total Events                  6146 non-null   object 
 8   Total Affected                4963 non-null   object 
 9   Total Deaths                  4359 non-null   object 
 10  Total Damage (USD, original)  2095 non-null   float64
 11  Total Damage (USD, adjusted)  2059 non-null   object 
 12  CPI                           5917 non-null   float64
dtypes: float64(2), object(11)
memory usage: 624.3+ KB

describe() zeigt die grundlegenden statistischen Eigenschaften von Spalten mit numerischem Datentyp, also int und float.

Die Methode berechnet:

die Anzahl an fehlenden Werten
Durchschnitt
Standardabweichung
Zahlenrange
Media
0.25 und 0.75 Quartile

data.describe()

	Total Damage (USD, original)	CPI
count	2095.00	5917.00
mean	1716696139.34	74.30
std	8615286125.55	11.91
min	2000.00	56.51
25%	20000000.00	64.09
50%	130000000.00	73.82
75%	801000000.00	82.41
max	210000000000.00	100.00

Data Cleaning#

Erste Zeile im DataFrame entfernen

Möglichkeit: Slicing

data[1]

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
10	2000	Argentina	ARG	Natural	Meteorological	Extreme temperature	Cold wave	1	300	15	NaN	NaN	56.51
11	2000	Argentina	ARG	Natural	Meteorological	Storm	Blizzard/Winter storm	1	NaN	NaN	NaN	NaN	56.51
12	2000	Argentina	ARG	Natural	Meteorological	Storm	Lightning/Thunderstorms	1	430	1	NaN	NaN	56.51
13	2000	Armenia	ARM	Natural	Climatological	Drought	Drought	1	297000	NaN	100000000.00	176946395	56.51
14	2000	Australia	AUS	Natural	Biological	Infestation	Locust infestation	1	NaN	NaN	120000000.00	212335674	56.51

data.index

RangeIndex(start=0, stop=6146, step=1)

data

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
0	#date +occurred	#country +name	#country +code	#cause +group	#cause +subgroup	#cause +type	#cause +subtype	#frequency	#affected +ind	#affected +ind +killed	NaN	#value +usd	NaN
1	2000	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.00	88473	56.51
2	2000	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	1	100	28	NaN	NaN	56.51
3	2000	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	56.51
4	2000	Angola	AGO	Natural	Hydrological	Flood	Flood (General)	3	9011	15	NaN	NaN	56.51
...	...	...	...	...	...	...	...	...	...	...	...	...	...
6141	2024	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	3	3747257	354	2000000000.00	NaN	NaN
6142	2024	Yemen	YEM	Natural	Hydrological	Flood	Flash flood	1	1075	40	NaN	NaN	NaN
6143	2024	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	2	210439	62	NaN	NaN	NaN
6144	2024	Zambia	ZMB	Natural	Climatological	Drought	Drought	1	6600000	NaN	NaN	NaN	NaN
6145	2024	Zimbabwe	ZWE	Natural	Climatological	Drought	Drought	1	7600000	NaN	NaN	NaN	NaN

6146 rows × 13 columns

Möglichkeit: Drop

siehe auch: Drop

data.drop(index=0)

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
1	2000	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.00	88473	56.51
2	2000	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	1	100	28	NaN	NaN	56.51
3	2000	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	56.51
4	2000	Angola	AGO	Natural	Hydrological	Flood	Flood (General)	3	9011	15	NaN	NaN	56.51
5	2000	Angola	AGO	Natural	Hydrological	Flood	Riverine flood	1	70000	31	10000000.00	17694640	56.51
...	...	...	...	...	...	...	...	...	...	...	...	...	...
6141	2024	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	3	3747257	354	2000000000.00	NaN	NaN
6142	2024	Yemen	YEM	Natural	Hydrological	Flood	Flash flood	1	1075	40	NaN	NaN	NaN
6143	2024	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	2	210439	62	NaN	NaN	NaN
6144	2024	Zambia	ZMB	Natural	Climatological	Drought	Drought	1	6600000	NaN	NaN	NaN	NaN
6145	2024	Zimbabwe	ZWE	Natural	Climatological	Drought	Drought	1	7600000	NaN	NaN	NaN	NaN

6145 rows × 13 columns

data = data.drop(index=0)
data

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
1	2000	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.00	88473	56.51
2	2000	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	1	100	28	NaN	NaN	56.51
3	2000	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	56.51
4	2000	Angola	AGO	Natural	Hydrological	Flood	Flood (General)	3	9011	15	NaN	NaN	56.51
5	2000	Angola	AGO	Natural	Hydrological	Flood	Riverine flood	1	70000	31	10000000.00	17694640	56.51
...	...	...	...	...	...	...	...	...	...	...	...	...	...
6141	2024	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	3	3747257	354	2000000000.00	NaN	NaN
6142	2024	Yemen	YEM	Natural	Hydrological	Flood	Flash flood	1	1075	40	NaN	NaN	NaN
6143	2024	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	2	210439	62	NaN	NaN	NaN
6144	2024	Zambia	ZMB	Natural	Climatological	Drought	Drought	1	6600000	NaN	NaN	NaN	NaN
6145	2024	Zimbabwe	ZWE	Natural	Climatological	Drought	Drought	1	7600000	NaN	NaN	NaN	NaN

6145 rows × 13 columns

data.drop(['Year'], axis=1)
#data

	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
1	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.00	88473	56.51
2	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	1	100	28	NaN	NaN	56.51
3	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	56.51
4	Angola	AGO	Natural	Hydrological	Flood	Flood (General)	3	9011	15	NaN	NaN	56.51
5	Angola	AGO	Natural	Hydrological	Flood	Riverine flood	1	70000	31	10000000.00	17694640	56.51
...	...	...	...	...	...	...	...	...	...	...	...	...
6141	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	3	3747257	354	2000000000.00	NaN	NaN
6142	Yemen	YEM	Natural	Hydrological	Flood	Flash flood	1	1075	40	NaN	NaN	NaN
6143	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	2	210439	62	NaN	NaN	NaN
6144	Zambia	ZMB	Natural	Climatological	Drought	Drought	1	6600000	NaN	NaN	NaN	NaN
6145	Zimbabwe	ZWE	Natural	Climatological	Drought	Drought	1	7600000	NaN	NaN	NaN	NaN

6145 rows × 12 columns

Entferne irrelevante Spalten

cols = ['ISO', 'Disaster Group', 'Total Damage (USD, adjusted)', 'CPI']
data = data.drop(cols, axis=0, inplace=True)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[27], line 2
      1 cols = ['Disaster Group', 'Total Damage (USD, adjusted)', 'CPI']
----> 2 data = data.drop(cols, axis=0, inplace=True)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/frame.py:5581, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   5433 def drop(
   5434     self,
   5435     labels: IndexLabel | None = None,
   (...)
   5442     errors: IgnoreRaise = "raise",
   5443 ) -> DataFrame | None:
   5444     """
   5445     Drop specified labels from rows or columns.
   5446 
   (...)
   5579             weight  1.0     0.8
   5580     """
-> 5581     return super().drop(
   5582         labels=labels,
   5583         axis=axis,
   5584         index=index,
   5585         columns=columns,
   5586         level=level,
   5587         inplace=inplace,
   5588         errors=errors,
   5589     )

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/generic.py:4788, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4786 for axis, labels in axes.items():
   4787     if labels is not None:
-> 4788         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4790 if inplace:
   4791     self._update_inplace(obj)

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/generic.py:4830, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice)
   4828         new_axis = axis.drop(labels, level=level, errors=errors)
   4829     else:
-> 4830         new_axis = axis.drop(labels, errors=errors)
   4831     indexer = axis.get_indexer(new_axis)
   4833 # Case for non-unique axis
   4834 else:

File ~/anaconda3/lib/python3.11/site-packages/pandas/core/indexes/base.py:7070, in Index.drop(self, labels, errors)
   7068 if mask.any():
   7069     if errors != "ignore":
-> 7070         raise KeyError(f"{labels[mask].tolist()} not found in axis")
   7071     indexer = indexer[~mask]
   7072 return self.delete(indexer)

KeyError: "['Disaster Group', 'Total Damage (USD, adjusted)', 'CPI'] not found in axis"

data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000	37	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	1	100	28	NaN
3	2000	Algeria	Meteorological	Storm	Storm (General)	1	10	4	NaN
4	2000	Angola	Hydrological	Flood	Flood (General)	3	9011	15	NaN
5	2000	Angola	Hydrological	Flood	Riverine flood	1	70000	31	10000000.00
...	...	...	...	...	...	...	...	...	...
6141	2024	Viet Nam	Meteorological	Storm	Tropical cyclone	3	3747257	354	2000000000.00
6142	2024	Yemen	Hydrological	Flood	Flash flood	1	1075	40	NaN
6143	2024	Yemen	Hydrological	Flood	Flood (General)	2	210439	62	NaN
6144	2024	Zambia	Climatological	Drought	Drought	1	6600000	NaN	NaN
6145	2024	Zimbabwe	Climatological	Drought	Drought	1	7600000	NaN	NaN

6145 rows × 9 columns

Einzelne Spalten auswählen und besser verstehen#

siehe auch:

Pandas Indexing

Auf die Werte einer Spalte kann <dataframe>['<spaltenname>'] zugegriffen werden.

data['Year']

     2000
     2000
     2000
     2000
     2000
        ... 
  2024
  2024
  2024
  2024
  2024
Name: Year, Length: 6145, dtype: object

Mehrere Spalten können über eine Liste ausgewählt werden

data[['Year', 'Country']]

	Year	Country
1	2000	Afghanistan
2	2000	Algeria
3	2000	Algeria
4	2000	Angola
5	2000	Angola
...	...	...
6141	2024	Viet Nam
6142	2024	Yemen
6143	2024	Yemen
6144	2024	Zambia
6145	2024	Zimbabwe

6145 rows × 2 columns

columns = ['Year', 'Country', 'Total Events', 'Total Affected']
data[columns]

	Year	Country	Total Events	Total Affected
1	2000	Afghanistan	1	2580000
2	2000	Algeria	1	100
3	2000	Algeria	1	10
4	2000	Angola	3	9011
5	2000	Angola	1	70000
...	...	...	...	...
6141	2024	Viet Nam	3	3747257
6142	2024	Yemen	1	1075
6143	2024	Yemen	2	210439
6144	2024	Zambia	1	6600000
6145	2024	Zimbabwe	1	7600000

6145 rows × 4 columns

Datentypen abfragen und anpassen#

siehe auch:

data.describe()

	Year	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
count	6145.00	6145.00	4962.00	4358.00	2095.00
mean	2011.90	1.52	950993.11	340.69	1716696139.34
std	7.42	1.28	8392369.40	5265.72	8615286125.55
min	2000.00	1.00	1.00	1.00	2000.00
25%	2005.00	1.00	1003.00	4.00	20000000.00
50%	2012.00	1.00	10000.00	14.00	130000000.00
75%	2019.00	2.00	100552.75	49.00	801000000.00
max	2024.00	17.00	330000000.00	222570.00	210000000000.00

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6145 entries, 1 to 6145
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          6145 non-null   int64  
 1   Country                       6145 non-null   object 
 2   Disaster Subroup              6145 non-null   object 
 3   Disaster Type                 6145 non-null   object 
 4   Disaster Subtype              6145 non-null   object 
 5   Total Events                  6145 non-null   int64  
 6   Total Affected                4962 non-null   float64
 7   Total Deaths                  4358 non-null   float64
 8   Total Damage (USD, original)  2095 non-null   float64
dtypes: float64(3), int64(2), object(4)
memory usage: 432.2+ KB

for x in ['Total Events', 'Total Affected', 'Total Deaths']:
    data[x] = pd.to_numeric(data[x])
    #data["Year"] = pd.to_numeric(data['Year'])

data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	1	100.00	28.00	NaN
3	2000	Algeria	Meteorological	Storm	Storm (General)	1	10.00	4.00	NaN
4	2000	Angola	Hydrological	Flood	Flood (General)	3	9011.00	15.00	NaN
5	2000	Angola	Hydrological	Flood	Riverine flood	1	70000.00	31.00	10000000.00
...	...	...	...	...	...	...	...	...	...
6141	2024	Viet Nam	Meteorological	Storm	Tropical cyclone	3	3747257.00	354.00	2000000000.00
6142	2024	Yemen	Hydrological	Flood	Flash flood	1	1075.00	40.00	NaN
6143	2024	Yemen	Hydrological	Flood	Flood (General)	2	210439.00	62.00	NaN
6144	2024	Zambia	Climatological	Drought	Drought	1	6600000.00	NaN	NaN
6145	2024	Zimbabwe	Climatological	Drought	Drought	1	7600000.00	NaN	NaN

6145 rows × 9 columns

data["Year"] = pd.to_numeric(data['Year'])

dtype('O')

data['Year']

     2000
     2000
     2000
     2000
     2000
        ... 
  2024
  2024
  2024
  2024
  2024
Name: Year, Length: 6145, dtype: int64

# Datentyp Abfrage mit dem Attribut
data['Year'].dtype

dtype('O')

# Umwandlung des Datentyp
data["Year"] = pd.to_numeric(data["Year"])
data['Year'].dtype

dtype('int64')

data.columns

# Auf alle integer und float Spalten anwenden
cols = ['Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, original)']
for col in cols:
    data[col] = pd.to_numeric(data[col])

data.info()

data.describe()

Überblick über die numerischen Daten#

data.describe()

	Year	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
count	6145.00	6145.00	4962.00	4358.00	2095.00
mean	2011.90	1.52	950993.11	340.69	1716696139.34
std	7.42	1.28	8392369.40	5265.72	8615286125.55
min	2000.00	1.00	1.00	1.00	2000.00
25%	2005.00	1.00	1003.00	4.00	20000000.00
50%	2012.00	1.00	10000.00	14.00	130000000.00
75%	2019.00	2.00	100552.75	49.00	801000000.00
max	2024.00	17.00	330000000.00	222570.00	210000000000.00

Überblick über die Objekt Daten#

Welche Länder kommen im Datensatz vor?

data['Country'].unique()

array(['Afghanistan', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bangladesh', 'Belarus',
       'Belize', 'Bhutan', 'Bolivia (Plurinational State of)',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Chile', 'China',
       'Colombia', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia',
       "Democratic People's Republic of Korea", 'Ecuador', 'Egypt',
       'El Salvador', 'Eswatini', 'Ethiopia', 'France', 'French Guiana',
       'Georgia', 'Greece', 'Guatemala', 'Guinea', 'Haiti', 'Honduras',
       'Hungary', 'Iceland', 'India', 'Indonesia',
       'Iran (Islamic Republic of)', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kyrgyzstan',
       "Lao People's Democratic Republic", 'Madagascar', 'Malawi',
       'Malaysia', 'Mali', 'Mexico', 'Mongolia', 'Morocco', 'Mozambique',
       'Namibia', 'Nepal', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
       'North Macedonia', 'Norway', 'Pakistan', 'Panama',
       'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland',
       'Portugal', 'Republic of Korea', 'Republic of Moldova', 'Romania',
       'Russian Federation', 'Réunion', 'Serbia Montenegro', 'Slovakia',
       'Somalia', 'South Africa', 'Spain', 'Sri Lanka', 'Sudan',
       'Switzerland', 'Taiwan (Province of China)', 'Tajikistan',
       'Thailand', 'Turkmenistan', 'Türkiye', 'Uganda', 'Ukraine',
       'United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'Uruguay', 'Uzbekistan', 'Venezuela (Bolivarian Republic of)',
       'Viet Nam', 'Zambia', 'Zimbabwe', 'Bahamas', 'Burkina Faso',
       'Canary Islands', 'Cayman Islands', 'Central African Republic',
       'Chad', 'Cook Islands', 'Democratic Republic of the Congo',
       'Djibouti', 'Dominican Republic', 'Fiji', 'Gambia', 'Germany',
       'Ghana', 'Latvia', 'Lesotho', 'Lithuania', 'Mauritania', 'Myanmar',
       'Puerto Rico', 'Rwanda', 'Saint Helena', 'Samoa',
       'Syrian Arab Republic', 'Timor-Leste', 'Tonga', 'Vanuatu', 'Yemen',
       'Albania', 'Barbados', 'Belgium', 'Cabo Verde', 'Congo', 'Denmark',
       'Grenada', 'Guam', 'Guinea-Bissau', 'Kenya', 'Lebanon',
       'Mauritius', 'Micronesia (Federated States of)',
       'Netherlands (Kingdom of the)', 'Northern Mariana Islands', 'Oman',
       'Saint Vincent and the Grenadines', 'Saudi Arabia', 'Senegal',
       'Seychelles', 'Solomon Islands', 'Sweden', 'American Samoa',
       'Bermuda', 'China, Hong Kong Special Administrative Region',
       'Comoros', 'Eritrea', 'Luxembourg', 'New Caledonia', 'Slovenia',
       'Tunisia', 'Dominica', 'Guadeloupe', 'Maldives', 'Niue',
       'Saint Lucia', 'Sierra Leone', 'Trinidad and Tobago',
       'Turks and Caicos Islands', 'United States Virgin Islands',
       'Estonia', 'Finland', 'Guyana', 'Tokelau', 'Iraq', 'Montserrat',
       'Suriname', 'Togo', 'Benin', 'Côte d’Ivoire', 'Liberia',
       'Martinique', 'Montenegro', 'Serbia', 'Antigua and Barbuda',
       'Kiribati', 'Marshall Islands', 'Saint Kitts and Nevis',
       'South Sudan', 'Gabon', 'French Polynesia', 'State of Palestine',
       'Tuvalu', 'Palau', 'Wallis and Futuna Islands', 'Libya',
       'Anguilla', 'British Virgin Islands',
       'China, Macao Special Administrative Region', 'Saint Barthélemy',
       'Saint Martin (French Part)', 'Sint Maarten (Dutch part)',
       'United Arab Emirates', 'Kuwait', 'Qatar', 'Isle of Man',
       'Sao Tome and Principe', 'Malta', 'Liechtenstein'], dtype=object)

dir(data['Country'])

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__column_consortium_standard__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_accum_func',
 '_agg_examples_doc',
 '_agg_see_also_doc',
 '_align_for_op',
 '_align_frame',
 '_align_series',
 '_append',
 '_arith_method',
 '_as_manager',
 '_attrs',
 '_binop',
 '_cacher',
 '_can_hold_na',
 '_check_inplace_and_allows_duplicate_labels',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_cmp_method',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_result',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_expanddim_from_mgr',
 '_constructor_from_mgr',
 '_data',
 '_deprecate_downcast',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_duplicated',
 '_find_valid_index',
 '_flags',
 '_flex_method',
 '_from_mgr',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_cleaned_column_resolvers',
 '_get_index_resolvers',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_rows_with_mask',
 '_get_value',
 '_get_values_tuple',
 '_get_with',
 '_getitem_slice',
 '_gotitem',
 '_hidden_attrs',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_dict',
 '_init_mgr',
 '_inplace_method',
 '_internal_names',
 '_internal_names_set',
 '_is_cached',
 '_is_copy',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_view',
 '_is_view_after_cow_rules',
 '_item_cache',
 '_ixs',
 '_logical_func',
 '_logical_method',
 '_map_values',
 '_maybe_update_cacher',
 '_memory_usage',
 '_metadata',
 '_mgr',
 '_min_count_stat_function',
 '_name',
 '_needs_reindex_multi',
 '_pad_or_backfill',
 '_protect_consolidate',
 '_reduce',
 '_references',
 '_reindex_axes',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_rename',
 '_replace_single',
 '_repr_data_resource_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_axis_nocheck',
 '_set_is_copy',
 '_set_labels',
 '_set_name',
 '_set_value',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_shift_with_freq',
 '_slice',
 '_stat_function',
 '_stat_function_ddof',
 '_take_with_is_copy',
 '_to_latex_via_styler',
 '_typ',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'attrs',
 'autocorr',
 'axes',
 'backfill',
 'between',
 'between_time',
 'bfill',
 'bool',
 'case_when',
 'clip',
 'combine',
 'combine_first',
 'compare',
 'convert_dtypes',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'explode',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'flags',
 'floordiv',
 'ge',
 'get',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'info',
 'interpolate',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'list',
 'loc',
 'lt',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pad',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'reindex',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'sem',
 'set_axis',
 'set_flags',
 'shape',
 'shift',
 'size',
 'skew',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'str',
 'struct',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'to_clipboard',
 'to_csv',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_markdown',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']

# Unterschiedliche Länder
countries = data['Country'].unique()
countries

len(countries)

sorted(countries)

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Canary Islands',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'China, Hong Kong Special Administrative Region',
 'China, Macao Special Administrative Region',
 'Colombia',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Côte d’Ivoire',
 "Democratic People's Republic of Korea",
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Eritrea',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guadeloupe',
 'Guam',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran (Islamic Republic of)',
 'Iraq',
 'Ireland',
 'Isle of Man',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kiribati',
 'Kuwait',
 'Kyrgyzstan',
 "Lao People's Democratic Republic",
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Liechtenstein',
 'Lithuania',
 'Luxembourg',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Marshall Islands',
 'Martinique',
 'Mauritania',
 'Mauritius',
 'Mexico',
 'Micronesia (Federated States of)',
 'Mongolia',
 'Montenegro',
 'Montserrat',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nepal',
 'Netherlands (Kingdom of the)',
 'New Caledonia',
 'New Zealand',
 'Nicaragua',
 'Niger',
 'Nigeria',
 'Niue',
 'North Macedonia',
 'Northern Mariana Islands',
 'Norway',
 'Oman',
 'Pakistan',
 'Palau',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Puerto Rico',
 'Qatar',
 'Republic of Korea',
 'Republic of Moldova',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Réunion',
 'Saint Barthélemy',
 'Saint Helena',
 'Saint Kitts and Nevis',
 'Saint Lucia',
 'Saint Martin (French Part)',
 'Saint Vincent and the Grenadines',
 'Samoa',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Senegal',
 'Serbia',
 'Serbia Montenegro',
 'Seychelles',
 'Sierra Leone',
 'Sint Maarten (Dutch part)',
 'Slovakia',
 'Slovenia',
 'Solomon Islands',
 'Somalia',
 'South Africa',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'State of Palestine',
 'Sudan',
 'Suriname',
 'Sweden',
 'Switzerland',
 'Syrian Arab Republic',
 'Taiwan (Province of China)',
 'Tajikistan',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Tokelau',
 'Tonga',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkmenistan',
 'Turks and Caicos Islands',
 'Tuvalu',
 'Türkiye',
 'Uganda',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland',
 'United Republic of Tanzania',
 'United States Virgin Islands',
 'United States of America',
 'Uruguay',
 'Uzbekistan',
 'Vanuatu',
 'Venezuela (Bolivarian Republic of)',
 'Viet Nam',
 'Wallis and Futuna Islands',
 'Yemen',
 'Zambia',
 'Zimbabwe']

# Vorkommen von Ländern der Liste
'Vietnam' in countries

False

# Vorkommen von Deutschland
for country in countries:
    if 'german' in country.lower():
        print(country)

Was sind die unterschiedlichen Katastrophentypen?#

data.columns

Index(['Year', 'Country', 'Disaster Subroup', 'Disaster Type',
       'Disaster Subtype', 'Total Events', 'Total Affected', 'Total Deaths',
       'Total Damage (USD, original)'],
      dtype='object')

data['Disaster Type'].unique()

['Animal incident',
 'Drought',
 'Earthquake',
 'Extreme temperature',
 'Flood',
 'Glacial lake outburst flood',
 'Impact',
 'Infestation',
 'Mass movement (dry)',
 'Mass movement (wet)',
 'Storm',
 'Volcanic activity',
 'Wildfire']

.value_counts() zeigt wie oft eine Spalte die unterschiedlichen Werte annimmt.

data['Country'].value_counts()

Country
United States of America    247
China                       216
India                       174
Indonesia                   146
Philippines                 121
                           ... 
Tokelau                       1
Niue                          1
Bermuda                       1
Saint Helena                  1
Liechtenstein                 1
Name: count, Length: 217, dtype: int64

data['Disaster Type'].value_counts()

Disaster Type
Flood                          2482
Storm                          1620
Extreme temperature             489
Drought                         403
Earthquake                      400
Mass movement (wet)             346
Wildfire                        246
Volcanic activity               110
Infestation                      29
Mass movement (dry)              14
Glacial lake outburst flood       4
Impact                            1
Animal incident                   1
Name: count, dtype: int64

Mit dem Argument normalize=True wird das Vorkommen der Werte automatisch ins Verhältnis gesetzt.

data['Disaster Type'].value_counts(normalize=True)

Disaster Type
Flood                         0.40
Storm                         0.26
Extreme temperature           0.08
Drought                       0.07
Earthquake                    0.07
Mass movement (wet)           0.06
Wildfire                      0.04
Volcanic activity             0.02
Infestation                   0.00
Mass movement (dry)           0.00
Glacial lake outburst flood   0.00
Impact                        0.00
Animal incident               0.00
Name: proportion, dtype: float64

Dataframes Sortieren#

Dataframes können anhand einer oder meherer Spalten sortiert werden.

Bei welchen Katastrophe waren am meisten Menschen betroffen?
Welche Naturkatastrophen waren am tödlichsten?

data.sort_values(by="Total Affected", ascending=False)

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
3752	2015	India	Climatological	Drought	Drought	1	330000000.00	NaN	3000000000.00
647	2002	India	Climatological	Drought	Drought	1	300000000.00	NaN	910722000.00
878	2003	China	Hydrological	Flood	Riverine flood	6	155924986.00	662.00	15329640000.00
2605	2010	China	Hydrological	Flood	Riverine flood	5	140194136.00	1911.00	18171000000.00
1894	2007	China	Hydrological	Flood	Riverine flood	9	108793242.00	967.00	4919155000.00
...	...	...	...	...	...	...	...	...	...
6105	2024	Switzerland	Meteorological	Storm	Extra-tropical storm	1	NaN	6.00	NaN
6114	2024	Türkiye	Climatological	Wildfire	Wildfire (General)	1	NaN	NaN	NaN
6120	2024	United Arab Emirates	Hydrological	Flood	Flash flood	1	NaN	4.00	NaN
6125	2024	United Republic of Tanzania	Hydrological	Mass movement (wet)	Landslide (wet)	1	NaN	22.00	NaN
6130	2024	United States of America	Meteorological	Storm	Blizzard/Winter storm	1	NaN	61.00	3800000000.00

6145 rows × 9 columns

data.sort_values(by="Total Affected", ascending=False).head(10)

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
3752	2015	India	Climatological	Drought	Drought	1	330000000.00	NaN	3000000000.00
647	2002	India	Climatological	Drought	Drought	1	300000000.00	NaN	910722000.00
878	2003	China	Hydrological	Flood	Riverine flood	6	155924986.00	662.00	15329640000.00
2605	2010	China	Hydrological	Flood	Riverine flood	5	140194136.00	1911.00	18171000000.00
1894	2007	China	Hydrological	Flood	Riverine flood	9	108793242.00	967.00	4919155000.00
589	2002	China	Meteorological	Storm	Sand/Dust storm	1	100000000.00	NaN	NaN
2863	2011	China	Hydrological	Flood	Riverine flood	5	93360000.00	628.00	10704130000.00
4099	2016	United States of America	Meteorological	Storm	Blizzard/Winter storm	4	85000057.00	90.00	2125000000.00
583	2002	China	Hydrological	Flood	Flash flood	1	80035257.00	793.00	3100000000.00
2152	2008	China	Meteorological	Extreme temperature	Severe winter conditions	2	77000000.00	145.00	21100000000.00

data.sort_values(by="Total Deaths", ascending=False).head(10)

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
2661	2010	Haiti	Geophysical	Earthquake	Ground movement	1	3700000.00	222570.00	8000000000.00
1164	2004	Indonesia	Geophysical	Earthquake	Tsunami	1	532898.00	165708.00	4451600000.00
2250	2008	Myanmar	Meteorological	Storm	Tropical cyclone	1	2420000.00	138366.00	4000000000.00
2148	2008	China	Geophysical	Earthquake	Ground movement	7	47369797.00	87564.00	85492000000.00
1497	2005	Pakistan	Geophysical	Earthquake	Ground movement	1	5128309.00	73338.00	5200000000.00
2761	2010	Russian Federation	Meteorological	Extreme temperature	Heat wave	1	NaN	55736.00	400000000.00
5882	2023	Türkiye	Geophysical	Earthquake	Ground movement	3	16107494.00	53007.00	34000000000.00
1270	2004	Sri Lanka	Geophysical	Earthquake	Tsunami	1	1019306.00	35399.00	1316500000.00
945	2003	Iran (Islamic Republic of)	Geophysical	Earthquake	Ground movement	5	297049.00	26797.00	521666000.00
950	2003	Italy	Meteorological	Extreme temperature	Heat wave	1	NaN	20089.00	4400000000.00

# Mehrere Argumente zum Sortieren sind möglich
data.sort_values(by=["Disaster Subroup", "Total Affected"], ascending=[False, True]).head(n=10)

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1488	2005	Netherlands (Kingdom of the)	Meteorological	Storm	Extra-tropical storm	1	1.00	NaN	NaN
3238	2012	United States of America	Meteorological	Storm	Blizzard/Winter storm	3	1.00	27.00	202000000.00
3502	2014	Germany	Meteorological	Storm	Lightning/Thunderstorms	2	1.00	8.00	400000000.00
5029	2020	Taiwan (Province of China)	Meteorological	Storm	Tropical cyclone	1	1.00	1.00	NaN
5988	2024	France	Meteorological	Storm	Extra-tropical storm	1	1.00	8.00	NaN
225	2000	Spain	Meteorological	Storm	Storm (General)	2	2.00	14.00	NaN
784	2002	Switzerland	Meteorological	Storm	Storm (General)	1	2.00	1.00	NaN
1419	2005	Germany	Meteorological	Storm	Extra-tropical storm	1	2.00	2.00	270000000.00
1530	2005	Russian Federation	Meteorological	Storm	Extra-tropical storm	1	2.00	NaN	NaN
1861	2007	Belgium	Meteorological	Storm	Extra-tropical storm	1	2.00	2.00	450000000.00

Analyse einzelner Aspekte durch Filtern von Daten#

Wie viele Naturkatastrophen gab es in Deutschland seit 2000?
Welchen Anteil haben Stürme?
Wie viele Menschen sind jedes Jahr betroffen?

Um diese Fragen zu beantworten filtern wir die Daten auf Deutschland und berechnen Statistiken.

Siehe auch:

data['Country']

     Afghanistan
         Algeria
         Algeria
          Angola
          Angola
           ...     
     Viet Nam
        Yemen
        Yemen
       Zambia
     Zimbabwe
Name: Country, Length: 6145, dtype: object

data[data['Country'] == 'Germany']

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
365	2001	Germany	Meteorological	Storm	Lightning/Thunderstorms	1	NaN	6.00	300000000.00
626	2002	Germany	Hydrological	Flood	Flood (General)	1	330108.00	27.00	11600000000.00
627	2002	Germany	Meteorological	Storm	Extra-tropical storm	1	NaN	11.00	1800000000.00
628	2002	Germany	Meteorological	Storm	Storm (General)	2	19.00	11.00	250000000.00
915	2003	Germany	Meteorological	Extreme temperature	Heat wave	1	NaN	9355.00	1650000000.00
916	2003	Germany	Meteorological	Storm	Extra-tropical storm	1	NaN	5.00	300000000.00
917	2003	Germany	Meteorological	Storm	Lightning/Thunderstorms	1	NaN	10.00	NaN
1148	2004	Germany	Geophysical	Earthquake	Ground movement	1	150.00	NaN	12000000.00
1149	2004	Germany	Meteorological	Storm	Storm (General)	1	NaN	2.00	130000000.00
1417	2005	Germany	Hydrological	Flood	Riverine flood	2	450.00	1.00	220000000.00
1418	2005	Germany	Meteorological	Extreme temperature	Cold wave	1	165.00	1.00	300000000.00
1419	2005	Germany	Meteorological	Storm	Extra-tropical storm	1	2.00	2.00	270000000.00
1677	2006	Germany	Hydrological	Flood	Riverine flood	1	1000.00	NaN	NaN
1678	2006	Germany	Meteorological	Extreme temperature	Heat wave	1	NaN	2.00	NaN
1679	2006	Germany	Meteorological	Extreme temperature	Severe winter conditions	1	NaN	10.00	NaN
1680	2006	Germany	Meteorological	Storm	Hail	1	100.00	1.00	NaN
1681	2006	Germany	Meteorological	Storm	Storm (General)	2	200.00	10.00	NaN
1931	2007	Germany	Hydrological	Flood	Riverine flood	1	NaN	1.00	NaN
1932	2007	Germany	Meteorological	Storm	Blizzard/Winter storm	1	NaN	7.00	NaN
1933	2007	Germany	Meteorological	Storm	Extra-tropical storm	1	130.00	11.00	5500000000.00
2187	2008	Germany	Meteorological	Storm	Extra-tropical storm	1	NaN	5.00	1200000000.00
2188	2008	Germany	Meteorological	Storm	Severe weather	1	NaN	3.00	1500000000.00
2415	2009	Germany	Hydrological	Flood	Riverine flood	1	NaN	NaN	20000000.00
2416	2009	Germany	Meteorological	Extreme temperature	Cold wave	2	NaN	15.00	NaN
2417	2009	Germany	Meteorological	Storm	Lightning/Thunderstorms	1	NaN	1.00	50000000.00
2647	2010	Germany	Hydrological	Flood	Flash flood	1	NaN	3.00	NaN
2648	2010	Germany	Meteorological	Extreme temperature	Cold wave	1	NaN	1.00	NaN
2649	2010	Germany	Meteorological	Storm	Blizzard/Winter storm	1	NaN	NaN	NaN
2650	2010	Germany	Meteorological	Storm	Extra-tropical storm	1	NaN	4.00	1000000000.00
3103	2012	Germany	Meteorological	Extreme temperature	Cold wave	2	NaN	6.00	NaN
3309	2013	Germany	Hydrological	Flood	Riverine flood	1	6350.00	4.00	12900000000.00
3310	2013	Germany	Meteorological	Storm	Extra-tropical storm	2	2.00	7.00	NaN
3311	2013	Germany	Meteorological	Storm	Hail	1	NaN	NaN	4800000000.00
3502	2014	Germany	Meteorological	Storm	Lightning/Thunderstorms	2	1.00	8.00	400000000.00
3960	2016	Germany	Hydrological	Flood	Flood (General)	1	NaN	7.00	2000000000.00
4192	2017	Germany	Hydrological	Flood	Riverine flood	1	600.00	NaN	NaN
4193	2017	Germany	Meteorological	Storm	Hail	1	NaN	2.00	740000000.00
4194	2017	Germany	Meteorological	Storm	Severe weather	1	24.00	3.00	159000000.00
4421	2018	Germany	Meteorological	Extreme temperature	Heat wave	1	NaN	NaN	NaN
4422	2018	Germany	Meteorological	Storm	Extra-tropical storm	1	12.00	5.00	588475000.00
4667	2019	Germany	Meteorological	Extreme temperature	Heat wave	2	NaN	4.00	NaN
4668	2019	Germany	Meteorological	Storm	Blizzard/Winter storm	1	NaN	1.00	NaN
4907	2020	Germany	Meteorological	Storm	Extra-tropical storm	1	33.00	NaN	NaN
5167	2021	Germany	Hydrological	Flood	Flood (General)	1	1000.00	226.00	40000000000.00
5168	2021	Germany	Meteorological	Storm	Lightning/Thunderstorms	1	600.00	NaN	NaN
5169	2021	Germany	Meteorological	Storm	Storm (General)	1	4.00	1.00	NaN
5434	2022	Germany	Meteorological	Extreme temperature	Heat wave	1	NaN	8173.00	NaN
5435	2022	Germany	Meteorological	Storm	Extra-tropical storm	3	2.00	7.00	1023156000.00
5716	2023	Germany	Meteorological	Extreme temperature	Heat wave	1	NaN	6376.00	NaN
5717	2023	Germany	Meteorological	Storm	Blizzard/Winter storm	1	NaN	2.00	NaN
5718	2023	Germany	Meteorological	Storm	Extra-tropical storm	1	NaN	1.00	NaN
5719	2023	Germany	Meteorological	Storm	Severe weather	1	NaN	1.00	NaN
5990	2024	Germany	Hydrological	Flood	Flood (General)	1	3407.00	12.00	5400000000.00

data[data['Country'] == 'Germany']

german_data = data[data['Country'] == 'Germany']
german_data.head(5)

Aufgaben#

Wie viele Naturkatastrophen gab es in Deutschland seit 2000?

Wann und was waren die schlimmsten Naturkatastrophen in Deutschland?

Wie viele Menschen waren insgesamt in Deutschland von Naturkatastrophen betroffen?

Wie viele Menschen sind 2024 in Deutschland bei Naturkatastrophen ums Leben gekommen?

Bei welchen Katastrophen in Deutschland starben mehr als 15 Personen?

Wie oft kommen die einzelnen Naturkatastrophentypen in Deutschland vor?

Groupby#

Die groupby-Funktion in Pandas wird verwendet, um Daten in Gruppen basierend auf einem oder mehreren Spaltenwerten zu organisieren. Auf diesen Gruppen können dann weitere Funktionen wie Berechnungen oder Transformationen angewendet werden.

Der Ablauf#

Gruppieren: Daten werden nach bestimmten Spaltenwerten gruppiert.
Anwenden: Auf jede Gruppe wird eine Operation (z. B. sum, mean, count) angewendet.
Kombinieren: Die Ergebnisse werden in einem neuen DataFrame oder Series zusammengefasst.

Wie viele Menschen waren je Naturkatastrophentyp in Deutschland betroffen?

german_data.groupby(['Disaster Type'])['Total Affected'].sum()

Wie viele Menschen sind pro Jahr von Naturkatastrophen betroffen?

german_data.groupby(['Year'])['Total Affected'].sum()

Wie viele Menschen sind in Deutschland pro Jahr und Katastrophentyp betroffen?

german_data.groupby(['Year', 'Disaster Type'])['Total Deaths'].sum()

Datenanalyse Teil I

Contents

Datenanalyse Teil I#

Natalie Widmann#

Datenanalyse und -verarbeitung in Python#

Ziele#

Was sind Daten?#

Strukturierte Daten#

Unstrukturierte Daten#

Semi-Strukturierte Daten#

Was sind Daten?#

Pandas#

Installation von Python Packages#

Idee, Daten finden & Verifikation#

Aggregated figures for Natural Disasters in EM-DAT#

Was ist die Geschichte?#

Mögliche Ansätze / Fragen an die Daten#

Daten einlesen mit Pandas#

Datenexploration und -bereinigung#

Überblick über die Daten#

siehe auch:#

Data Cleaning#

Einzelne Spalten auswählen und besser verstehen#

Datentypen abfragen und anpassen#

Überblick über die numerischen Daten#

Überblick über die Objekt Daten#

Was sind die unterschiedlichen Katastrophentypen?#

Dataframes Sortieren#

Analyse einzelner Aspekte durch Filtern von Daten#

Aufgaben#

Groupby#

Der Ablauf#