Datenanalyse Teil II#
Natalie Widmann#
Wintersemester 2023 / 2024
Aggregated figures for Natural Disasters in EM-DAT#
Link: https://data.humdata.org/dataset/emdat-country-profiles
In 1988, the Centre for Research on the Epidemiology of Disasters (CRED) launched the Emergency Events Database (EM-DAT). EM-DAT was created with the initial support of the World Health Organisation (WHO) and the Belgian Government.
The main objective of the database is to serve the purposes of humanitarian action at national and international levels. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.
EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.
# Install a pip package im Jupyter Notebook
!pip3 install pandas
!pip3 install openpyxl
Requirement already satisfied: pandas in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (2.1.3)
Requirement already satisfied: numpy<2,>=1.23.2 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (1.26.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: openpyxl in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (3.1.2)
Requirement already satisfied: et-xmlfile in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from openpyxl) (1.1.0)
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)
data = pd.read_excel('../../data/emdat.xlsx', engine="openpyxl")
data
Year | Country | ISO | Disaster Group | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | Total Damage (USD, adjusted) | CPI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | #date +occurred | #country +name | #country +code | #cause +group | #cause +subgroup | #cause +type | #cause +subtype | #frequency | #affected +ind | #affected +ind +killed | NaN | #value +usd | NaN |
1 | 2000 | Afghanistan | AFG | Natural | Climatological | Drought | Drought | 1 | 2580000 | 37 | 50000.00 | 84975 | 58.84 |
2 | 2000 | Algeria | DZA | Natural | Hydrological | Flood | Flash flood | 2 | 105 | 37 | NaN | NaN | 58.84 |
3 | 2000 | Algeria | DZA | Natural | Hydrological | Flood | Flood (General) | 2 | 100 | 7 | NaN | NaN | 58.84 |
4 | 2000 | Algeria | DZA | Natural | Meteorological | Storm | Storm (General) | 1 | 10 | 4 | NaN | NaN | 58.84 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5834 | 2023 | Viet Nam | VNM | Natural | Meteorological | Storm | Tropical cyclone | 1 | 3 | 1 | NaN | NaN | NaN |
5835 | 2023 | Yemen | YEM | Natural | Hydrological | Flood | Flood (General) | 1 | 169035 | 39 | NaN | NaN | NaN |
5836 | 2023 | Zambia | ZMB | Natural | Hydrological | Flood | Flash flood | 1 | 154608 | NaN | NaN | NaN | NaN |
5837 | 2023 | Zambia | ZMB | Natural | Hydrological | Flood | Flood (General) | 1 | 22000 | NaN | NaN | NaN | NaN |
5838 | 2023 | Zimbabwe | ZWE | Natural | Meteorological | Storm | Tropical cyclone | 1 | NaN | 2 | NaN | NaN | NaN |
5839 rows × 13 columns
Data Cleaning#
Entferne erste Zeile
data = data.drop(index=0)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 5838 non-null object
1 Country 5838 non-null object
2 ISO 5838 non-null object
3 Disaster Group 5838 non-null object
4 Disaster Subroup 5838 non-null object
5 Disaster Type 5838 non-null object
6 Disaster Subtype 5838 non-null object
7 Total Events 5838 non-null object
8 Total Affected 4692 non-null object
9 Total Deaths 4118 non-null object
10 Total Damage (USD, original) 2032 non-null float64
11 Total Damage (USD, adjusted) 1999 non-null object
12 CPI 5648 non-null float64
dtypes: float64(2), object(11)
memory usage: 593.1+ KB
Änderung des Datentyps für integeger und float Spalten
cols = ['Year', 'Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, adjusted)']
for col in cols:
data[col] = pd.to_numeric(data[col])
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 5838 non-null int64
1 Country 5838 non-null object
2 ISO 5838 non-null object
3 Disaster Group 5838 non-null object
4 Disaster Subroup 5838 non-null object
5 Disaster Type 5838 non-null object
6 Disaster Subtype 5838 non-null object
7 Total Events 5838 non-null int64
8 Total Affected 4692 non-null float64
9 Total Deaths 4118 non-null float64
10 Total Damage (USD, original) 2032 non-null float64
11 Total Damage (USD, adjusted) 1999 non-null float64
12 CPI 5648 non-null float64
dtypes: float64(5), int64(2), object(6)
memory usage: 593.1+ KB
Entferne irrelevante Spalten, hier: ISO, Disaster Group, Total Damage (USD, adjusted)
cols = ['ISO', 'Disaster Group', 'Total Damage (USD, adjusted)', 'CPI']
data.drop(cols, axis=1, inplace=True)
data
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | Afghanistan | Climatological | Drought | Drought | 1 | 2580000.00 | 37.00 | 50000.00 |
2 | 2000 | Algeria | Hydrological | Flood | Flash flood | 2 | 105.00 | 37.00 | NaN |
3 | 2000 | Algeria | Hydrological | Flood | Flood (General) | 2 | 100.00 | 7.00 | NaN |
4 | 2000 | Algeria | Meteorological | Storm | Storm (General) | 1 | 10.00 | 4.00 | NaN |
5 | 2000 | Angola | Hydrological | Flood | Flood (General) | 3 | 9011.00 | 15.00 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5834 | 2023 | Viet Nam | Meteorological | Storm | Tropical cyclone | 1 | 3.00 | 1.00 | NaN |
5835 | 2023 | Yemen | Hydrological | Flood | Flood (General) | 1 | 169035.00 | 39.00 | NaN |
5836 | 2023 | Zambia | Hydrological | Flood | Flash flood | 1 | 154608.00 | NaN | NaN |
5837 | 2023 | Zambia | Hydrological | Flood | Flood (General) | 1 | 22000.00 | NaN | NaN |
5838 | 2023 | Zimbabwe | Meteorological | Storm | Tropical cyclone | 1 | NaN | 2.00 | NaN |
5838 rows × 9 columns
Fehlende Werte - NaNs Values#
Wo fehlen Werte im Datensatz? Wie viele fehlen?
len(data)
5838
Wie viele Werte pro Spalte sind NaN?#
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 5838 non-null int64
1 Country 5838 non-null object
2 Disaster Subroup 5838 non-null object
3 Disaster Type 5838 non-null object
4 Disaster Subtype 5838 non-null object
5 Total Events 5838 non-null int64
6 Total Affected 4692 non-null float64
7 Total Deaths 4118 non-null float64
8 Total Damage (USD, original) 2032 non-null float64
dtypes: float64(3), int64(2), object(4)
memory usage: 410.6+ KB
data['Total Affected'].isnull()
1 False
2 False
3 False
4 False
5 False
...
5834 False
5835 False
5836 False
5837 False
5838 True
Name: Total Affected, Length: 5838, dtype: bool
data['Total Affected'].isnull().sum()
1146
data.isnull()
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | True |
3 | False | False | False | False | False | False | False | False | True |
4 | False | False | False | False | False | False | False | False | True |
5 | False | False | False | False | False | False | False | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5834 | False | False | False | False | False | False | False | False | True |
5835 | False | False | False | False | False | False | False | False | True |
5836 | False | False | False | False | False | False | False | True | True |
5837 | False | False | False | False | False | False | False | True | True |
5838 | False | False | False | False | False | False | True | False | True |
5838 rows × 9 columns
data.isnull().sum()
Year 0
Country 0
Disaster Subroup 0
Disaster Type 0
Disaster Subtype 0
Total Events 0
Total Affected 1146
Total Deaths 1720
Total Damage (USD, original) 3806
dtype: int64
Möglichkeiten zum Umgang mit fehlenden Werte#
Zeilen mit fehlenden Werten entfernen
fehlende Werte ersetzen
Alle Zeilen mit NaN Werten entfernen#
dropna()
clean_data = data.dropna()
clean_data
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | Afghanistan | Climatological | Drought | Drought | 1 | 2580000.00 | 37.00 | 50000.00 |
6 | 2000 | Angola | Hydrological | Flood | Riverine flood | 1 | 70000.00 | 31.00 | 10000000.00 |
9 | 2000 | Argentina | Hydrological | Flood | Flood (General) | 3 | 31840.00 | 6.00 | 300000000.00 |
18 | 2000 | Australia | Meteorological | Storm | Tropical cyclone | 3 | 1350.00 | 1.00 | 150000000.00 |
22 | 2000 | Azerbaijan | Geophysical | Earthquake | Ground movement | 1 | 3294.00 | 31.00 | 10000000.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5824 | 2023 | United States of America | Climatological | Wildfire | Wildfire (General) | 2 | 8497.00 | 130.00 | 3000000000.00 |
5827 | 2023 | United States of America | Meteorological | Extreme temperature | Severe winter conditions | 1 | 25.00 | 10.00 | 380000000.00 |
5828 | 2023 | United States of America | Meteorological | Storm | Severe weather | 1 | 551.00 | 3.00 | 250000000.00 |
5829 | 2023 | United States of America | Meteorological | Storm | Storm (General) | 3 | 912.00 | 37.00 | 7000000000.00 |
5830 | 2023 | United States of America | Meteorological | Storm | Tornado | 5 | 14217.00 | 75.00 | 19560000000.00 |
1500 rows × 9 columns
Entferne nur Zeilen die in einer bestimmten Spalte NaN Werte haben
clean_data = data.dropna(subset=['Disaster Subtype', 'Total Affected'])
clean_data
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | Afghanistan | Climatological | Drought | Drought | 1 | 2580000.00 | 37.00 | 50000.00 |
2 | 2000 | Algeria | Hydrological | Flood | Flash flood | 2 | 105.00 | 37.00 | NaN |
3 | 2000 | Algeria | Hydrological | Flood | Flood (General) | 2 | 100.00 | 7.00 | NaN |
4 | 2000 | Algeria | Meteorological | Storm | Storm (General) | 1 | 10.00 | 4.00 | NaN |
5 | 2000 | Angola | Hydrological | Flood | Flood (General) | 3 | 9011.00 | 15.00 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5833 | 2023 | Viet Nam | Hydrological | Flood | Flood (General) | 1 | 463.00 | 4.00 | NaN |
5834 | 2023 | Viet Nam | Meteorological | Storm | Tropical cyclone | 1 | 3.00 | 1.00 | NaN |
5835 | 2023 | Yemen | Hydrological | Flood | Flood (General) | 1 | 169035.00 | 39.00 | NaN |
5836 | 2023 | Zambia | Hydrological | Flood | Flash flood | 1 | 154608.00 | NaN | NaN |
5837 | 2023 | Zambia | Hydrological | Flood | Flood (General) | 1 | 22000.00 | NaN | NaN |
4692 rows × 9 columns
Fehlende Werte entfernen#
Vorteile
einfache Handhabung
verständliche Methodik
Nachteile
ein großer Teil der Daten wird nicht ausgewertet
fehlende Werte können eine ganz andere Geschichte erzählen
die Konsistenz über unterschiedliche Fragestellungen hinweg muss sichergestellt sein
Fehlende Werte ersetzen#
Mit .fillna()
können alle NaN Werte in einem DataFrame oder einer Series automatisch ersetzt werden.
Das Argument inplace=True
wendet die Methode direkt auf den DataFrame an.
data.fillna(0)
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | Afghanistan | Climatological | Drought | Drought | 1 | 2580000.00 | 37.00 | 50000.00 |
2 | 2000 | Algeria | Hydrological | Flood | Flash flood | 2 | 105.00 | 37.00 | 0.00 |
3 | 2000 | Algeria | Hydrological | Flood | Flood (General) | 2 | 100.00 | 7.00 | 0.00 |
4 | 2000 | Algeria | Meteorological | Storm | Storm (General) | 1 | 10.00 | 4.00 | 0.00 |
5 | 2000 | Angola | Hydrological | Flood | Flood (General) | 3 | 9011.00 | 15.00 | 0.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5834 | 2023 | Viet Nam | Meteorological | Storm | Tropical cyclone | 1 | 3.00 | 1.00 | 0.00 |
5835 | 2023 | Yemen | Hydrological | Flood | Flood (General) | 1 | 169035.00 | 39.00 | 0.00 |
5836 | 2023 | Zambia | Hydrological | Flood | Flash flood | 1 | 154608.00 | 0.00 | 0.00 |
5837 | 2023 | Zambia | Hydrological | Flood | Flood (General) | 1 | 22000.00 | 0.00 | 0.00 |
5838 | 2023 | Zimbabwe | Meteorological | Storm | Tropical cyclone | 1 | 0.00 | 2.00 | 0.00 |
5838 rows × 9 columns
Welche Werte für Total Affected, Total Deaths oder Total Damage?
0
arithmetisches Mittel
Median
Berechnung aus anderen Spalten
data["Total Deaths"].fillna(0, inplace=True)
data["Total Affected"].fillna(data['Total Deaths'], inplace=True)
data['Total Damage (USD, original)'].mean()
1639896721.4173229
data['Total Damage (USD, original)'] = data['Total Damage (USD, original)'].fillna(data['Total Damage (USD, original)'].mean())
data
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | Afghanistan | Climatological | Drought | Drought | 1 | 2580000.00 | 37.00 | 50000.00 |
2 | 2000 | Algeria | Hydrological | Flood | Flash flood | 2 | 105.00 | 37.00 | 1639896721.42 |
3 | 2000 | Algeria | Hydrological | Flood | Flood (General) | 2 | 100.00 | 7.00 | 1639896721.42 |
4 | 2000 | Algeria | Meteorological | Storm | Storm (General) | 1 | 10.00 | 4.00 | 1639896721.42 |
5 | 2000 | Angola | Hydrological | Flood | Flood (General) | 3 | 9011.00 | 15.00 | 1639896721.42 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5834 | 2023 | Viet Nam | Meteorological | Storm | Tropical cyclone | 1 | 3.00 | 1.00 | 1639896721.42 |
5835 | 2023 | Yemen | Hydrological | Flood | Flood (General) | 1 | 169035.00 | 39.00 | 1639896721.42 |
5836 | 2023 | Zambia | Hydrological | Flood | Flash flood | 1 | 154608.00 | 0.00 | 1639896721.42 |
5837 | 2023 | Zambia | Hydrological | Flood | Flood (General) | 1 | 22000.00 | 0.00 | 1639896721.42 |
5838 | 2023 | Zimbabwe | Meteorological | Storm | Tropical cyclone | 1 | 2.00 | 2.00 | 1639896721.42 |
5838 rows × 9 columns
Fehlende Werte ersetzen#
Vorteile
Verwendung möglichst aller Daten
genaue Analyse der fehlenden Werte kann zu neuen Erkenntnissen führen
Nachteile
keine Standard Methodik
eventuelle Verfälschung der Daten
Duplikate#
.duplicated()
findet Zeilen mit genau den gleichen Werte.
Mit drop_duplicates()
können diese entfernt werden.
data.duplicated().sum()
0
data.drop_duplicates()
Year | Country | Disaster Subroup | Disaster Type | Disaster Subtype | Total Events | Total Affected | Total Deaths | Total Damage (USD, original) | |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | Afghanistan | Climatological | Drought | Drought | 1 | 2580000.00 | 37.00 | 50000.00 |
2 | 2000 | Algeria | Hydrological | Flood | Flash flood | 2 | 105.00 | 37.00 | 1639896721.42 |
3 | 2000 | Algeria | Hydrological | Flood | Flood (General) | 2 | 100.00 | 7.00 | 1639896721.42 |
4 | 2000 | Algeria | Meteorological | Storm | Storm (General) | 1 | 10.00 | 4.00 | 1639896721.42 |
5 | 2000 | Angola | Hydrological | Flood | Flood (General) | 3 | 9011.00 | 15.00 | 1639896721.42 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5834 | 2023 | Viet Nam | Meteorological | Storm | Tropical cyclone | 1 | 3.00 | 1.00 | 1639896721.42 |
5835 | 2023 | Yemen | Hydrological | Flood | Flood (General) | 1 | 169035.00 | 39.00 | 1639896721.42 |
5836 | 2023 | Zambia | Hydrological | Flood | Flash flood | 1 | 154608.00 | 0.00 | 1639896721.42 |
5837 | 2023 | Zambia | Hydrological | Flood | Flood (General) | 1 | 22000.00 | 0.00 | 1639896721.42 |
5838 | 2023 | Zimbabwe | Meteorological | Storm | Tropical cyclone | 1 | 2.00 | 2.00 | 1639896721.42 |
5838 rows × 9 columns
Daten speichern & Neu laden#
data.to_csv('../../data/clean_emdat.csv')