# Datenanalyse Teil I


### Natalie Widmann




Wintersemester 2024 / 2025




## Datenanalyse und -verarbeitung  in Python


### Ziele

- Verständnis der Analyseschritte im Datenjournalismus
- Eplorative Analyse von strukturierten Daten
- Grundkenntnisse bei der Visualisierung von Daten
- Kennenlernen von Python Package Pandas



![Datenpipeline](../../imgs/datapipeline.png)


## Was sind Daten?


### Strukturierte Daten

Strukturierte Daten sind gut organisiert und so formattiert, dass es einfach ist sie zu durchsuchen, sie maschinell zu lesen oder zu verarbeiten. Das einfachste Beispiel ist eine Tabelle in der jede Spalte eine Kategorie oder einen Wert festlegt. 


### Unstrukturierte Daten

Im Gegensatz dazu sind unstrukturierte Daten nicht in einem bestimmten Format oder einer festgelegten Struktur verfügbar. Dazu zählen Texte, Bilder, Social Media Feeds, aber auch Audio Files, etc.


### Semi-Strukturierte Daten

Semi-strukturierte Daten bilden eine Mischform. Beispielsweise eine Tabelle mit E-Mail Daten, in der Empfänger, Betreff, Datum und Absender strukturierte Informationen enthalten, der eigentliche Text jedoch unstrukturiert ist. 

## Was sind Daten?

![Daten](../../imgs/data.png)


## Pandas


[Pandas](https://pandas.pydata.org/) ist ein Python Package und ist abgeleitet aus "Python and data analysis".

Pandas stellt die Grundfunktionalitäten für das Arbeiten mit strukturierten Daten zur Verfügung.

 

#### Installation von Python Packages

Packages die von der Python Community zur Verfügung gestellt werden, müssen vor der Verwendung installiert werden.
Dafür können Packagemanager wie [`pip`](https://pypi.org/project/pip/) verwendet werden.

Weitere Tipps für die Installation von Python Packages in Windows, Linux und Mac gibt es [hier](https://packaging.python.org/en/latest/tutorials/installing-packages/).

In Jupyter Notebooks können Packages wie folgt installiert werden:

In [1]:
# Install a pip package im Jupyter Notebook
import sys
!pip install pandas
!pip install openpyxl



In [2]:
import pandas as pd

pd.set_option('display.float_format', '{:.2f}'.format)

## Idee, Daten finden & Verifikation


### Aggregated figures for Natural Disasters in EM-DAT

Link: https://data.humdata.org/dataset/emdat-country-profiles


In 1988, the **Centre for Research on the Epidemiology of Disasters (CRED)** launched the **Emergency Events Database (EM-DAT)**. EM-DAT was created with the initial support of the **World Health Organisation (WHO) and the Belgian Government**.

The main objective of the database is to **serve the purposes of humanitarian action at national and international levels**. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.

EM-DAT contains essential core data on the **occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day**. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.



### Was ist die Geschichte?

#### Mögliche Ansätze / Fragen an die Daten

- Steigt die Anzahl an Naturkatastrophen weltweit?
- In welchem Jahr gabe es die meisten Naturkatastrophen?
- Welche Länder sind am stärksten von Naturkatastrophen betroffen?
- Wie ist die Situation in Deutschland?
- Welche Länder sind von Naturkatastrophen betroffen haben aber vergleichsweise geringe Todesfälle?
- Welche Naturkatastrophen sind am tödlichsten?

### Daten einlesen mit Pandas

siehe auch:
- [Doku - 02 Read & Write](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html)
- [IO tools](https://pandas.pydata.org/docs/user_guide/io.html)


In [3]:
data_url = "https://data.humdata.org/dataset/74163686-a029-4e27-8fbf-c5bfcd13f953/resource/c5ce40d6-07b1-4f36-955a-d6196436ff6b/download/emdat-country-profiles_2024_12_02.xlsx"
data = pd.read_excel(data_url, engine="openpyxl")
data

Unnamed: 0,Year,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
0,#date +occurred,#country +name,#country +code,#cause +group,#cause +subgroup,#cause +type,#cause +subtype,#frequency,#affected +ind,#affected +ind +killed,,#value +usd,
1,2000,Afghanistan,AFG,Natural,Climatological,Drought,Drought,1,2580000,37,50000.00,88473,56.51
2,2000,Algeria,DZA,Natural,Hydrological,Flood,Flash flood,1,100,28,,,56.51
3,2000,Algeria,DZA,Natural,Meteorological,Storm,Storm (General),1,10,4,,,56.51
4,2000,Angola,AGO,Natural,Hydrological,Flood,Flood (General),3,9011,15,,,56.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,2024,Viet Nam,VNM,Natural,Meteorological,Storm,Tropical cyclone,3,3747257,354,2000000000.00,,
6142,2024,Yemen,YEM,Natural,Hydrological,Flood,Flash flood,1,1075,40,,,
6143,2024,Yemen,YEM,Natural,Hydrological,Flood,Flood (General),2,210439,62,,,
6144,2024,Zambia,ZMB,Natural,Climatological,Drought,Drought,1,6600000,,,,


## Datenexploration und -bereinigung

### Überblick über die Daten

- Wie groß ist der Datensatz?
- Wie viele Zeilen und wie viele Spalten sind vorhanden?


siehe auch:
- 

In [4]:
data.shape

(6146, 13)

- Die Spaltennamen

In [5]:
data.columns

Index(['Year', 'Country', 'ISO', 'Disaster Group', 'Disaster Subroup',
       'Disaster Type', 'Disaster Subtype', 'Total Events', 'Total Affected',
       'Total Deaths', 'Total Damage (USD, original)',
       'Total Damage (USD, adjusted)', 'CPI'],
      dtype='object')

`info()` für mehr Infos über die Spalten

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6146 entries, 0 to 6145
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          6146 non-null   object 
 1   Country                       6146 non-null   object 
 2   ISO                           6146 non-null   object 
 3   Disaster Group                6146 non-null   object 
 4   Disaster Subroup              6146 non-null   object 
 5   Disaster Type                 6146 non-null   object 
 6   Disaster Subtype              6146 non-null   object 
 7   Total Events                  6146 non-null   object 
 8   Total Affected                4963 non-null   object 
 9   Total Deaths                  4359 non-null   object 
 10  Total Damage (USD, original)  2095 non-null   float64
 11  Total Damage (USD, adjusted)  2059 non-null   object 
 12  CPI                           5917 non-null   float64
dtypes: 

`describe()` zeigt die grundlegenden statistischen Eigenschaften von Spalten mit numerischem Datentyp, also `int` und `float`. 

Die Methode berechnet:
- die Anzahl an fehlenden Werten
- Durchschnitt
- Standardabweichung
- Zahlenrange
- Media
- 0.25 und 0.75 Quartile

In [7]:
data.describe()

Unnamed: 0,"Total Damage (USD, original)",CPI
count,2095.0,5917.0
mean,1716696139.34,74.3
std,8615286125.55,11.91
min,2000.0,56.51
25%,20000000.0,64.09
50%,130000000.0,73.82
75%,801000000.0,82.41
max,210000000000.0,100.0


### Data Cleaning


- Erste Zeile im DataFrame entfernen

1. Möglichkeit: Slicing

In [11]:
data[1]

Unnamed: 0,Year,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
10,2000,Argentina,ARG,Natural,Meteorological,Extreme temperature,Cold wave,1,300.0,15.0,,,56.51
11,2000,Argentina,ARG,Natural,Meteorological,Storm,Blizzard/Winter storm,1,,,,,56.51
12,2000,Argentina,ARG,Natural,Meteorological,Storm,Lightning/Thunderstorms,1,430.0,1.0,,,56.51
13,2000,Armenia,ARM,Natural,Climatological,Drought,Drought,1,297000.0,,100000000.0,176946395.0,56.51
14,2000,Australia,AUS,Natural,Biological,Infestation,Locust infestation,1,,,120000000.0,212335674.0,56.51


In [12]:
data.index

RangeIndex(start=0, stop=6146, step=1)

In [15]:
data

Unnamed: 0,Year,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
0,#date +occurred,#country +name,#country +code,#cause +group,#cause +subgroup,#cause +type,#cause +subtype,#frequency,#affected +ind,#affected +ind +killed,,#value +usd,
1,2000,Afghanistan,AFG,Natural,Climatological,Drought,Drought,1,2580000,37,50000.00,88473,56.51
2,2000,Algeria,DZA,Natural,Hydrological,Flood,Flash flood,1,100,28,,,56.51
3,2000,Algeria,DZA,Natural,Meteorological,Storm,Storm (General),1,10,4,,,56.51
4,2000,Angola,AGO,Natural,Hydrological,Flood,Flood (General),3,9011,15,,,56.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,2024,Viet Nam,VNM,Natural,Meteorological,Storm,Tropical cyclone,3,3747257,354,2000000000.00,,
6142,2024,Yemen,YEM,Natural,Hydrological,Flood,Flash flood,1,1075,40,,,
6143,2024,Yemen,YEM,Natural,Hydrological,Flood,Flood (General),2,210439,62,,,
6144,2024,Zambia,ZMB,Natural,Climatological,Drought,Drought,1,6600000,,,,


2. Möglichkeit: Drop

siehe auch: [Drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop)

In [19]:
data.drop(index=0)

Unnamed: 0,Year,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
1,2000,Afghanistan,AFG,Natural,Climatological,Drought,Drought,1,2580000,37,50000.00,88473,56.51
2,2000,Algeria,DZA,Natural,Hydrological,Flood,Flash flood,1,100,28,,,56.51
3,2000,Algeria,DZA,Natural,Meteorological,Storm,Storm (General),1,10,4,,,56.51
4,2000,Angola,AGO,Natural,Hydrological,Flood,Flood (General),3,9011,15,,,56.51
5,2000,Angola,AGO,Natural,Hydrological,Flood,Riverine flood,1,70000,31,10000000.00,17694640,56.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,2024,Viet Nam,VNM,Natural,Meteorological,Storm,Tropical cyclone,3,3747257,354,2000000000.00,,
6142,2024,Yemen,YEM,Natural,Hydrological,Flood,Flash flood,1,1075,40,,,
6143,2024,Yemen,YEM,Natural,Hydrological,Flood,Flood (General),2,210439,62,,,
6144,2024,Zambia,ZMB,Natural,Climatological,Drought,Drought,1,6600000,,,,


In [20]:
data = data.drop(index=0)
data

Unnamed: 0,Year,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
1,2000,Afghanistan,AFG,Natural,Climatological,Drought,Drought,1,2580000,37,50000.00,88473,56.51
2,2000,Algeria,DZA,Natural,Hydrological,Flood,Flash flood,1,100,28,,,56.51
3,2000,Algeria,DZA,Natural,Meteorological,Storm,Storm (General),1,10,4,,,56.51
4,2000,Angola,AGO,Natural,Hydrological,Flood,Flood (General),3,9011,15,,,56.51
5,2000,Angola,AGO,Natural,Hydrological,Flood,Riverine flood,1,70000,31,10000000.00,17694640,56.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,2024,Viet Nam,VNM,Natural,Meteorological,Storm,Tropical cyclone,3,3747257,354,2000000000.00,,
6142,2024,Yemen,YEM,Natural,Hydrological,Flood,Flash flood,1,1075,40,,,
6143,2024,Yemen,YEM,Natural,Hydrological,Flood,Flood (General),2,210439,62,,,
6144,2024,Zambia,ZMB,Natural,Climatological,Drought,Drought,1,6600000,,,,


In [22]:
data.drop(['Year'], axis=1)
#data

Unnamed: 0,Country,ISO,Disaster Group,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)","Total Damage (USD, adjusted)",CPI
1,Afghanistan,AFG,Natural,Climatological,Drought,Drought,1,2580000,37,50000.00,88473,56.51
2,Algeria,DZA,Natural,Hydrological,Flood,Flash flood,1,100,28,,,56.51
3,Algeria,DZA,Natural,Meteorological,Storm,Storm (General),1,10,4,,,56.51
4,Angola,AGO,Natural,Hydrological,Flood,Flood (General),3,9011,15,,,56.51
5,Angola,AGO,Natural,Hydrological,Flood,Riverine flood,1,70000,31,10000000.00,17694640,56.51
...,...,...,...,...,...,...,...,...,...,...,...,...
6141,Viet Nam,VNM,Natural,Meteorological,Storm,Tropical cyclone,3,3747257,354,2000000000.00,,
6142,Yemen,YEM,Natural,Hydrological,Flood,Flash flood,1,1075,40,,,
6143,Yemen,YEM,Natural,Hydrological,Flood,Flood (General),2,210439,62,,,
6144,Zambia,ZMB,Natural,Climatological,Drought,Drought,1,6600000,,,,


- Entferne irrelevante Spalten

In [27]:
cols = ['ISO', 'Disaster Group', 'Total Damage (USD, adjusted)', 'CPI']
data = data.drop(cols, axis=0, inplace=True)

KeyError: "['Disaster Group', 'Total Damage (USD, adjusted)', 'CPI'] not found in axis"

In [28]:
data

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
1,2000,Afghanistan,Climatological,Drought,Drought,1,2580000,37,50000.00
2,2000,Algeria,Hydrological,Flood,Flash flood,1,100,28,
3,2000,Algeria,Meteorological,Storm,Storm (General),1,10,4,
4,2000,Angola,Hydrological,Flood,Flood (General),3,9011,15,
5,2000,Angola,Hydrological,Flood,Riverine flood,1,70000,31,10000000.00
...,...,...,...,...,...,...,...,...,...
6141,2024,Viet Nam,Meteorological,Storm,Tropical cyclone,3,3747257,354,2000000000.00
6142,2024,Yemen,Hydrological,Flood,Flash flood,1,1075,40,
6143,2024,Yemen,Hydrological,Flood,Flood (General),2,210439,62,
6144,2024,Zambia,Climatological,Drought,Drought,1,6600000,,


### Einzelne Spalten auswählen und besser verstehen

siehe auch:
- [Pandas Indexing](https://pandas.pydata.org/docs/user_guide/indexing.html)

Auf die Werte einer Spalte kann `<dataframe>['<spaltenname>']` zugegriffen werden.

In [29]:
data['Year']

1       2000
2       2000
3       2000
4       2000
5       2000
        ... 
6141    2024
6142    2024
6143    2024
6144    2024
6145    2024
Name: Year, Length: 6145, dtype: object

Mehrere Spalten können über eine Liste ausgewählt werden

In [30]:
data[['Year', 'Country']]

Unnamed: 0,Year,Country
1,2000,Afghanistan
2,2000,Algeria
3,2000,Algeria
4,2000,Angola
5,2000,Angola
...,...,...
6141,2024,Viet Nam
6142,2024,Yemen
6143,2024,Yemen
6144,2024,Zambia


In [31]:
columns = ['Year', 'Country', 'Total Events', 'Total Affected']
data[columns]

Unnamed: 0,Year,Country,Total Events,Total Affected
1,2000,Afghanistan,1,2580000
2,2000,Algeria,1,100
3,2000,Algeria,1,10
4,2000,Angola,3,9011
5,2000,Angola,1,70000
...,...,...,...,...
6141,2024,Viet Nam,3,3747257
6142,2024,Yemen,1,1075
6143,2024,Yemen,2,210439
6144,2024,Zambia,1,6600000


### Datentypen abfragen und anpassen

siehe auch:
- [basic dtypes](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes)
- [pandas arrays, scalars and datatypes](https://pandas.pydata.org/docs/reference/arrays.html)

In [46]:
data.describe()

Unnamed: 0,Year,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
count,6145.0,6145.0,4962.0,4358.0,2095.0
mean,2011.9,1.52,950993.11,340.69,1716696139.34
std,7.42,1.28,8392369.4,5265.72,8615286125.55
min,2000.0,1.0,1.0,1.0,2000.0
25%,2005.0,1.0,1003.0,4.0,20000000.0
50%,2012.0,1.0,10000.0,14.0,130000000.0
75%,2019.0,2.0,100552.75,49.0,801000000.0
max,2024.0,17.0,330000000.0,222570.0,210000000000.0


In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6145 entries, 1 to 6145
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          6145 non-null   int64  
 1   Country                       6145 non-null   object 
 2   Disaster Subroup              6145 non-null   object 
 3   Disaster Type                 6145 non-null   object 
 4   Disaster Subtype              6145 non-null   object 
 5   Total Events                  6145 non-null   int64  
 6   Total Affected                4962 non-null   float64
 7   Total Deaths                  4358 non-null   float64
 8   Total Damage (USD, original)  2095 non-null   float64
dtypes: float64(3), int64(2), object(4)
memory usage: 432.2+ KB


In [43]:
for x in ['Total Events', 'Total Affected', 'Total Deaths']:
    data[x] = pd.to_numeric(data[x])
    #data["Year"] = pd.to_numeric(data['Year'])

In [44]:
data

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
1,2000,Afghanistan,Climatological,Drought,Drought,1,2580000.00,37.00,50000.00
2,2000,Algeria,Hydrological,Flood,Flash flood,1,100.00,28.00,
3,2000,Algeria,Meteorological,Storm,Storm (General),1,10.00,4.00,
4,2000,Angola,Hydrological,Flood,Flood (General),3,9011.00,15.00,
5,2000,Angola,Hydrological,Flood,Riverine flood,1,70000.00,31.00,10000000.00
...,...,...,...,...,...,...,...,...,...
6141,2024,Viet Nam,Meteorological,Storm,Tropical cyclone,3,3747257.00,354.00,2000000000.00
6142,2024,Yemen,Hydrological,Flood,Flash flood,1,1075.00,40.00,
6143,2024,Yemen,Hydrological,Flood,Flood (General),2,210439.00,62.00,
6144,2024,Zambia,Climatological,Drought,Drought,1,6600000.00,,


In [None]:
data["Year"] = pd.to_numeric(data['Year'])

In [42]:
y

dtype('O')

In [39]:
data['Year']

1       2000
2       2000
3       2000
4       2000
5       2000
        ... 
6141    2024
6142    2024
6143    2024
6144    2024
6145    2024
Name: Year, Length: 6145, dtype: int64

In [34]:
# Datentyp Abfrage mit dem Attribut
data['Year'].dtype

dtype('O')

In [35]:
# Umwandlung des Datentyp
data["Year"] = pd.to_numeric(data["Year"])
data['Year'].dtype

dtype('int64')

In [None]:
data.columns

In [None]:
# Auf alle integer und float Spalten anwenden
cols = ['Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, original)']
for col in cols:
    data[col] = pd.to_numeric(data[col])

In [None]:
data.info()

In [None]:
data.describe()

### Überblick über die numerischen Daten

In [47]:
data.describe()

Unnamed: 0,Year,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
count,6145.0,6145.0,4962.0,4358.0,2095.0
mean,2011.9,1.52,950993.11,340.69,1716696139.34
std,7.42,1.28,8392369.4,5265.72,8615286125.55
min,2000.0,1.0,1.0,1.0,2000.0
25%,2005.0,1.0,1003.0,4.0,20000000.0
50%,2012.0,1.0,10000.0,14.0,130000000.0
75%,2019.0,2.0,100552.75,49.0,801000000.0
max,2024.0,17.0,330000000.0,222570.0,210000000000.0


### Überblick über die Objekt Daten

- Welche Länder kommen im Datensatz vor?

In [48]:
data['Country'].unique()

array(['Afghanistan', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bangladesh', 'Belarus',
       'Belize', 'Bhutan', 'Bolivia (Plurinational State of)',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria',
       'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Chile', 'China',
       'Colombia', 'Costa Rica', 'Croatia', 'Cuba', 'Cyprus', 'Czechia',
       "Democratic People's Republic of Korea", 'Ecuador', 'Egypt',
       'El Salvador', 'Eswatini', 'Ethiopia', 'France', 'French Guiana',
       'Georgia', 'Greece', 'Guatemala', 'Guinea', 'Haiti', 'Honduras',
       'Hungary', 'Iceland', 'India', 'Indonesia',
       'Iran (Islamic Republic of)', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kyrgyzstan',
       "Lao People's Democratic Republic", 'Madagascar', 'Malawi',
       'Malaysia', 'Mali', 'Mexico', 'Mongolia', 'Morocco', 'Mozambique',
       'Namibia', 'Nepal', 'New Zealand', 'Nicaragua'

In [51]:
dir(data['Country'])

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__column_consortium_standard__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__

In [None]:
# Unterschiedliche Länder
countries = data['Country'].unique()
countries

In [None]:
len(countries)

In [55]:
sorted(countries)

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Canary Islands',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'China, Hong Kong Special Administrative Region',
 'China, Macao Special Administrative Region',
 'Colombia',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Côte d’Ivoire',
 "Democratic People's Republic of Korea",
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Eritre

In [58]:
# Vorkommen von Ländern der Liste
'Vietnam' in countries

False

In [None]:
# Vorkommen von Deutschland
for country in countries:
    if 'german' in country.lower():
        print(country)

### Was sind die unterschiedlichen Katastrophentypen?

In [59]:
data.columns

Index(['Year', 'Country', 'Disaster Subroup', 'Disaster Type',
       'Disaster Subtype', 'Total Events', 'Total Affected', 'Total Deaths',
       'Total Damage (USD, original)'],
      dtype='object')

In [61]:
data['Disaster Type'].unique()

['Animal incident',
 'Drought',
 'Earthquake',
 'Extreme temperature',
 'Flood',
 'Glacial lake outburst flood',
 'Impact',
 'Infestation',
 'Mass movement (dry)',
 'Mass movement (wet)',
 'Storm',
 'Volcanic activity',
 'Wildfire']

`.value_counts()` zeigt wie oft eine Spalte die unterschiedlichen Werte annimmt.

In [63]:
data['Country'].value_counts()

Country
United States of America    247
China                       216
India                       174
Indonesia                   146
Philippines                 121
                           ... 
Tokelau                       1
Niue                          1
Bermuda                       1
Saint Helena                  1
Liechtenstein                 1
Name: count, Length: 217, dtype: int64

In [62]:
data['Disaster Type'].value_counts()

Disaster Type
Flood                          2482
Storm                          1620
Extreme temperature             489
Drought                         403
Earthquake                      400
Mass movement (wet)             346
Wildfire                        246
Volcanic activity               110
Infestation                      29
Mass movement (dry)              14
Glacial lake outburst flood       4
Impact                            1
Animal incident                   1
Name: count, dtype: int64

Mit dem Argument `normalize=True` wird das Vorkommen der Werte automatisch ins Verhältnis gesetzt.

In [64]:
data['Disaster Type'].value_counts(normalize=True)

Disaster Type
Flood                         0.40
Storm                         0.26
Extreme temperature           0.08
Drought                       0.07
Earthquake                    0.07
Mass movement (wet)           0.06
Wildfire                      0.04
Volcanic activity             0.02
Infestation                   0.00
Mass movement (dry)           0.00
Glacial lake outburst flood   0.00
Impact                        0.00
Animal incident               0.00
Name: proportion, dtype: float64


### Dataframes Sortieren

Dataframes können anhand einer oder meherer Spalten sortiert werden.

- Bei welchen Katastrophe waren am meisten Menschen betroffen?
- Welche Naturkatastrophen waren am tödlichsten?


In [66]:
data.sort_values(by="Total Affected", ascending=False)

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
3752,2015,India,Climatological,Drought,Drought,1,330000000.00,,3000000000.00
647,2002,India,Climatological,Drought,Drought,1,300000000.00,,910722000.00
878,2003,China,Hydrological,Flood,Riverine flood,6,155924986.00,662.00,15329640000.00
2605,2010,China,Hydrological,Flood,Riverine flood,5,140194136.00,1911.00,18171000000.00
1894,2007,China,Hydrological,Flood,Riverine flood,9,108793242.00,967.00,4919155000.00
...,...,...,...,...,...,...,...,...,...
6105,2024,Switzerland,Meteorological,Storm,Extra-tropical storm,1,,6.00,
6114,2024,Türkiye,Climatological,Wildfire,Wildfire (General),1,,,
6120,2024,United Arab Emirates,Hydrological,Flood,Flash flood,1,,4.00,
6125,2024,United Republic of Tanzania,Hydrological,Mass movement (wet),Landslide (wet),1,,22.00,


In [70]:
data.sort_values(by="Total Affected", ascending=False).head(10)

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
3752,2015,India,Climatological,Drought,Drought,1,330000000.0,,3000000000.0
647,2002,India,Climatological,Drought,Drought,1,300000000.0,,910722000.0
878,2003,China,Hydrological,Flood,Riverine flood,6,155924986.0,662.0,15329640000.0
2605,2010,China,Hydrological,Flood,Riverine flood,5,140194136.0,1911.0,18171000000.0
1894,2007,China,Hydrological,Flood,Riverine flood,9,108793242.0,967.0,4919155000.0
589,2002,China,Meteorological,Storm,Sand/Dust storm,1,100000000.0,,
2863,2011,China,Hydrological,Flood,Riverine flood,5,93360000.0,628.0,10704130000.0
4099,2016,United States of America,Meteorological,Storm,Blizzard/Winter storm,4,85000057.0,90.0,2125000000.0
583,2002,China,Hydrological,Flood,Flash flood,1,80035257.0,793.0,3100000000.0
2152,2008,China,Meteorological,Extreme temperature,Severe winter conditions,2,77000000.0,145.0,21100000000.0


In [71]:
data.sort_values(by="Total Deaths", ascending=False).head(10)

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
2661,2010,Haiti,Geophysical,Earthquake,Ground movement,1,3700000.0,222570.0,8000000000.0
1164,2004,Indonesia,Geophysical,Earthquake,Tsunami,1,532898.0,165708.0,4451600000.0
2250,2008,Myanmar,Meteorological,Storm,Tropical cyclone,1,2420000.0,138366.0,4000000000.0
2148,2008,China,Geophysical,Earthquake,Ground movement,7,47369797.0,87564.0,85492000000.0
1497,2005,Pakistan,Geophysical,Earthquake,Ground movement,1,5128309.0,73338.0,5200000000.0
2761,2010,Russian Federation,Meteorological,Extreme temperature,Heat wave,1,,55736.0,400000000.0
5882,2023,Türkiye,Geophysical,Earthquake,Ground movement,3,16107494.0,53007.0,34000000000.0
1270,2004,Sri Lanka,Geophysical,Earthquake,Tsunami,1,1019306.0,35399.0,1316500000.0
945,2003,Iran (Islamic Republic of),Geophysical,Earthquake,Ground movement,5,297049.0,26797.0,521666000.0
950,2003,Italy,Meteorological,Extreme temperature,Heat wave,1,,20089.0,4400000000.0


In [73]:
# Mehrere Argumente zum Sortieren sind möglich
data.sort_values(by=["Disaster Subroup", "Total Affected"], ascending=[False, True]).head(n=10)

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
1488,2005,Netherlands (Kingdom of the),Meteorological,Storm,Extra-tropical storm,1,1.0,,
3238,2012,United States of America,Meteorological,Storm,Blizzard/Winter storm,3,1.0,27.0,202000000.0
3502,2014,Germany,Meteorological,Storm,Lightning/Thunderstorms,2,1.0,8.0,400000000.0
5029,2020,Taiwan (Province of China),Meteorological,Storm,Tropical cyclone,1,1.0,1.0,
5988,2024,France,Meteorological,Storm,Extra-tropical storm,1,1.0,8.0,
225,2000,Spain,Meteorological,Storm,Storm (General),2,2.0,14.0,
784,2002,Switzerland,Meteorological,Storm,Storm (General),1,2.0,1.0,
1419,2005,Germany,Meteorological,Storm,Extra-tropical storm,1,2.0,2.0,270000000.0
1530,2005,Russian Federation,Meteorological,Storm,Extra-tropical storm,1,2.0,,
1861,2007,Belgium,Meteorological,Storm,Extra-tropical storm,1,2.0,2.0,450000000.0


### Analyse einzelner Aspekte durch Filtern von Daten

- Wie viele Naturkatastrophen gab es in Deutschland seit 2000?
- Welchen Anteil haben Stürme?
- Wie viele Menschen sind jedes Jahr betroffen?


Um diese Fragen zu beantworten filtern wir die Daten auf Deutschland und berechnen Statistiken.

Siehe auch:

- [Selecting Subsets in Pandas](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html)
- [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html)
- [How to Calculate Summary Statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html)

In [77]:
data['Country']

1       Afghanistan
2           Algeria
3           Algeria
4            Angola
5            Angola
           ...     
6141       Viet Nam
6142          Yemen
6143          Yemen
6144         Zambia
6145       Zimbabwe
Name: Country, Length: 6145, dtype: object

In [78]:
data[data['Country'] == 'Germany']

Unnamed: 0,Year,Country,Disaster Subroup,Disaster Type,Disaster Subtype,Total Events,Total Affected,Total Deaths,"Total Damage (USD, original)"
365,2001,Germany,Meteorological,Storm,Lightning/Thunderstorms,1,,6.0,300000000.0
626,2002,Germany,Hydrological,Flood,Flood (General),1,330108.0,27.0,11600000000.0
627,2002,Germany,Meteorological,Storm,Extra-tropical storm,1,,11.0,1800000000.0
628,2002,Germany,Meteorological,Storm,Storm (General),2,19.0,11.0,250000000.0
915,2003,Germany,Meteorological,Extreme temperature,Heat wave,1,,9355.0,1650000000.0
916,2003,Germany,Meteorological,Storm,Extra-tropical storm,1,,5.0,300000000.0
917,2003,Germany,Meteorological,Storm,Lightning/Thunderstorms,1,,10.0,
1148,2004,Germany,Geophysical,Earthquake,Ground movement,1,150.0,,12000000.0
1149,2004,Germany,Meteorological,Storm,Storm (General),1,,2.0,130000000.0
1417,2005,Germany,Hydrological,Flood,Riverine flood,2,450.0,1.0,220000000.0


In [None]:
data[data['Country'] == 'Germany']

In [None]:
german_data = data[data['Country'] == 'Germany']
german_data.head(5)

## Aufgaben

- Wie viele Naturkatastrophen gab es in Deutschland seit 2000?

- Wann und was waren die schlimmsten Naturkatastrophen in Deutschland?

- Wie viele Menschen waren insgesamt in Deutschland von Naturkatastrophen betroffen?

- Wie viele Menschen sind 2024 in Deutschland bei Naturkatastrophen ums Leben gekommen?

- Bei welchen Katastrophen in Deutschland starben mehr als 15 Personen?

- Wie oft kommen die einzelnen Naturkatastrophentypen in Deutschland vor?

### Groupby

Die groupby-Funktion in Pandas wird verwendet, um Daten in Gruppen basierend auf einem oder mehreren Spaltenwerten zu organisieren.
Auf diesen Gruppen können dann weitere Funktionen wie Berechnungen oder Transformationen angewendet werden.

##### Der Ablauf

- **Gruppieren**: Daten werden nach bestimmten Spaltenwerten gruppiert.
- **Anwenden**: Auf jede Gruppe wird eine Operation (z. B. sum, mean, count) angewendet.
- **Kombinieren**: Die Ergebnisse werden in einem neuen DataFrame oder Series zusammengefasst.


- Wie viele Menschen waren je Naturkatastrophentyp in Deutschland betroffen?

In [None]:
german_data.groupby(['Disaster Type'])['Total Affected'].sum()

- Wie viele Menschen sind pro Jahr von Naturkatastrophen betroffen?

In [None]:
german_data.groupby(['Year'])['Total Affected'].sum()

- Wie viele Menschen sind in Deutschland pro Jahr und Katastrophentyp betroffen?

In [None]:
german_data.groupby(['Year', 'Disaster Type'])['Total Deaths'].sum()