Datenanalyse Teil II#

Natalie Widmann#

Wintersemester 2023 / 2024

Aggregated figures for Natural Disasters in EM-DAT#

Link: https://data.humdata.org/dataset/emdat-country-profiles

In 1988, the Centre for Research on the Epidemiology of Disasters (CRED) launched the Emergency Events Database (EM-DAT). EM-DAT was created with the initial support of the World Health Organisation (WHO) and the Belgian Government.

The main objective of the database is to serve the purposes of humanitarian action at national and international levels. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.

EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.

# Install a pip package im Jupyter Notebook
!pip3 install pandas
!pip3 install openpyxl
Requirement already satisfied: pandas in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (2.1.3)
Requirement already satisfied: numpy<2,>=1.23.2 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (1.26.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: openpyxl in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (3.1.2)
Requirement already satisfied: et-xmlfile in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from openpyxl) (1.1.0)
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)

data = pd.read_excel('../../data/emdat.xlsx', engine="openpyxl")
data
Year Country ISO Disaster Group Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original) Total Damage (USD, adjusted) CPI
0 #date +occurred #country +name #country +code #cause +group #cause +subgroup #cause +type #cause +subtype #frequency #affected +ind #affected +ind +killed NaN #value +usd NaN
1 2000 Afghanistan AFG Natural Climatological Drought Drought 1 2580000 37 50000.00 84975 58.84
2 2000 Algeria DZA Natural Hydrological Flood Flash flood 2 105 37 NaN NaN 58.84
3 2000 Algeria DZA Natural Hydrological Flood Flood (General) 2 100 7 NaN NaN 58.84
4 2000 Algeria DZA Natural Meteorological Storm Storm (General) 1 10 4 NaN NaN 58.84
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5834 2023 Viet Nam VNM Natural Meteorological Storm Tropical cyclone 1 3 1 NaN NaN NaN
5835 2023 Yemen YEM Natural Hydrological Flood Flood (General) 1 169035 39 NaN NaN NaN
5836 2023 Zambia ZMB Natural Hydrological Flood Flash flood 1 154608 NaN NaN NaN NaN
5837 2023 Zambia ZMB Natural Hydrological Flood Flood (General) 1 22000 NaN NaN NaN NaN
5838 2023 Zimbabwe ZWE Natural Meteorological Storm Tropical cyclone 1 NaN 2 NaN NaN NaN

5839 rows × 13 columns

Data Cleaning#

Entferne erste Zeile

data = data.drop(index=0)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          5838 non-null   object 
 1   Country                       5838 non-null   object 
 2   ISO                           5838 non-null   object 
 3   Disaster Group                5838 non-null   object 
 4   Disaster Subroup              5838 non-null   object 
 5   Disaster Type                 5838 non-null   object 
 6   Disaster Subtype              5838 non-null   object 
 7   Total Events                  5838 non-null   object 
 8   Total Affected                4692 non-null   object 
 9   Total Deaths                  4118 non-null   object 
 10  Total Damage (USD, original)  2032 non-null   float64
 11  Total Damage (USD, adjusted)  1999 non-null   object 
 12  CPI                           5648 non-null   float64
dtypes: float64(2), object(11)
memory usage: 593.1+ KB

Änderung des Datentyps für integeger und float Spalten

cols = ['Year', 'Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, adjusted)']
for col in cols:
    data[col] = pd.to_numeric(data[col])
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          5838 non-null   int64  
 1   Country                       5838 non-null   object 
 2   ISO                           5838 non-null   object 
 3   Disaster Group                5838 non-null   object 
 4   Disaster Subroup              5838 non-null   object 
 5   Disaster Type                 5838 non-null   object 
 6   Disaster Subtype              5838 non-null   object 
 7   Total Events                  5838 non-null   int64  
 8   Total Affected                4692 non-null   float64
 9   Total Deaths                  4118 non-null   float64
 10  Total Damage (USD, original)  2032 non-null   float64
 11  Total Damage (USD, adjusted)  1999 non-null   float64
 12  CPI                           5648 non-null   float64
dtypes: float64(5), int64(2), object(6)
memory usage: 593.1+ KB

Entferne irrelevante Spalten, hier: ISO, Disaster Group, Total Damage (USD, adjusted)

cols = ['ISO', 'Disaster Group', 'Total Damage (USD, adjusted)', 'CPI']
data.drop(cols, axis=1, inplace=True)
data
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 2000 Afghanistan Climatological Drought Drought 1 2580000.00 37.00 50000.00
2 2000 Algeria Hydrological Flood Flash flood 2 105.00 37.00 NaN
3 2000 Algeria Hydrological Flood Flood (General) 2 100.00 7.00 NaN
4 2000 Algeria Meteorological Storm Storm (General) 1 10.00 4.00 NaN
5 2000 Angola Hydrological Flood Flood (General) 3 9011.00 15.00 NaN
... ... ... ... ... ... ... ... ... ...
5834 2023 Viet Nam Meteorological Storm Tropical cyclone 1 3.00 1.00 NaN
5835 2023 Yemen Hydrological Flood Flood (General) 1 169035.00 39.00 NaN
5836 2023 Zambia Hydrological Flood Flash flood 1 154608.00 NaN NaN
5837 2023 Zambia Hydrological Flood Flood (General) 1 22000.00 NaN NaN
5838 2023 Zimbabwe Meteorological Storm Tropical cyclone 1 NaN 2.00 NaN

5838 rows × 9 columns

Fehlende Werte - NaNs Values#

Wo fehlen Werte im Datensatz? Wie viele fehlen?

len(data)
5838

Wie viele Werte pro Spalte sind NaN?#

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          5838 non-null   int64  
 1   Country                       5838 non-null   object 
 2   Disaster Subroup              5838 non-null   object 
 3   Disaster Type                 5838 non-null   object 
 4   Disaster Subtype              5838 non-null   object 
 5   Total Events                  5838 non-null   int64  
 6   Total Affected                4692 non-null   float64
 7   Total Deaths                  4118 non-null   float64
 8   Total Damage (USD, original)  2032 non-null   float64
dtypes: float64(3), int64(2), object(4)
memory usage: 410.6+ KB
data['Total Affected'].isnull()
1       False
2       False
3       False
4       False
5       False
        ...  
5834    False
5835    False
5836    False
5837    False
5838     True
Name: Total Affected, Length: 5838, dtype: bool
data['Total Affected'].isnull().sum()
1146
data.isnull()
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 False False False False False False False False False
2 False False False False False False False False True
3 False False False False False False False False True
4 False False False False False False False False True
5 False False False False False False False False True
... ... ... ... ... ... ... ... ... ...
5834 False False False False False False False False True
5835 False False False False False False False False True
5836 False False False False False False False True True
5837 False False False False False False False True True
5838 False False False False False False True False True

5838 rows × 9 columns

data.isnull().sum()
Year                               0
Country                            0
Disaster Subroup                   0
Disaster Type                      0
Disaster Subtype                   0
Total Events                       0
Total Affected                  1146
Total Deaths                    1720
Total Damage (USD, original)    3806
dtype: int64

Möglichkeiten zum Umgang mit fehlenden Werte#

  • Zeilen mit fehlenden Werten entfernen

  • fehlende Werte ersetzen

Alle Zeilen mit NaN Werten entfernen#

dropna()

clean_data = data.dropna()
clean_data
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 2000 Afghanistan Climatological Drought Drought 1 2580000.00 37.00 50000.00
6 2000 Angola Hydrological Flood Riverine flood 1 70000.00 31.00 10000000.00
9 2000 Argentina Hydrological Flood Flood (General) 3 31840.00 6.00 300000000.00
18 2000 Australia Meteorological Storm Tropical cyclone 3 1350.00 1.00 150000000.00
22 2000 Azerbaijan Geophysical Earthquake Ground movement 1 3294.00 31.00 10000000.00
... ... ... ... ... ... ... ... ... ...
5824 2023 United States of America Climatological Wildfire Wildfire (General) 2 8497.00 130.00 3000000000.00
5827 2023 United States of America Meteorological Extreme temperature Severe winter conditions 1 25.00 10.00 380000000.00
5828 2023 United States of America Meteorological Storm Severe weather 1 551.00 3.00 250000000.00
5829 2023 United States of America Meteorological Storm Storm (General) 3 912.00 37.00 7000000000.00
5830 2023 United States of America Meteorological Storm Tornado 5 14217.00 75.00 19560000000.00

1500 rows × 9 columns

Entferne nur Zeilen die in einer bestimmten Spalte NaN Werte haben

clean_data = data.dropna(subset=['Disaster Subtype', 'Total Affected'])
clean_data
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 2000 Afghanistan Climatological Drought Drought 1 2580000.00 37.00 50000.00
2 2000 Algeria Hydrological Flood Flash flood 2 105.00 37.00 NaN
3 2000 Algeria Hydrological Flood Flood (General) 2 100.00 7.00 NaN
4 2000 Algeria Meteorological Storm Storm (General) 1 10.00 4.00 NaN
5 2000 Angola Hydrological Flood Flood (General) 3 9011.00 15.00 NaN
... ... ... ... ... ... ... ... ... ...
5833 2023 Viet Nam Hydrological Flood Flood (General) 1 463.00 4.00 NaN
5834 2023 Viet Nam Meteorological Storm Tropical cyclone 1 3.00 1.00 NaN
5835 2023 Yemen Hydrological Flood Flood (General) 1 169035.00 39.00 NaN
5836 2023 Zambia Hydrological Flood Flash flood 1 154608.00 NaN NaN
5837 2023 Zambia Hydrological Flood Flood (General) 1 22000.00 NaN NaN

4692 rows × 9 columns

Fehlende Werte entfernen#

Vorteile

  • einfache Handhabung

  • verständliche Methodik

Nachteile

  • ein großer Teil der Daten wird nicht ausgewertet

  • fehlende Werte können eine ganz andere Geschichte erzählen

  • die Konsistenz über unterschiedliche Fragestellungen hinweg muss sichergestellt sein

Fehlende Werte ersetzen#

Mit .fillna() können alle NaN Werte in einem DataFrame oder einer Series automatisch ersetzt werden.

Das Argument inplace=True wendet die Methode direkt auf den DataFrame an.

data.fillna(0)
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 2000 Afghanistan Climatological Drought Drought 1 2580000.00 37.00 50000.00
2 2000 Algeria Hydrological Flood Flash flood 2 105.00 37.00 0.00
3 2000 Algeria Hydrological Flood Flood (General) 2 100.00 7.00 0.00
4 2000 Algeria Meteorological Storm Storm (General) 1 10.00 4.00 0.00
5 2000 Angola Hydrological Flood Flood (General) 3 9011.00 15.00 0.00
... ... ... ... ... ... ... ... ... ...
5834 2023 Viet Nam Meteorological Storm Tropical cyclone 1 3.00 1.00 0.00
5835 2023 Yemen Hydrological Flood Flood (General) 1 169035.00 39.00 0.00
5836 2023 Zambia Hydrological Flood Flash flood 1 154608.00 0.00 0.00
5837 2023 Zambia Hydrological Flood Flood (General) 1 22000.00 0.00 0.00
5838 2023 Zimbabwe Meteorological Storm Tropical cyclone 1 0.00 2.00 0.00

5838 rows × 9 columns

Welche Werte für Total Affected, Total Deaths oder Total Damage?

  • 0

  • arithmetisches Mittel

  • Median

  • Berechnung aus anderen Spalten

data["Total Deaths"].fillna(0, inplace=True)
data["Total Affected"].fillna(data['Total Deaths'], inplace=True)
data['Total Damage (USD, original)'].mean()
1639896721.4173229
data['Total Damage (USD, original)'] = data['Total Damage (USD, original)'].fillna(data['Total Damage (USD, original)'].mean())
data
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 2000 Afghanistan Climatological Drought Drought 1 2580000.00 37.00 50000.00
2 2000 Algeria Hydrological Flood Flash flood 2 105.00 37.00 1639896721.42
3 2000 Algeria Hydrological Flood Flood (General) 2 100.00 7.00 1639896721.42
4 2000 Algeria Meteorological Storm Storm (General) 1 10.00 4.00 1639896721.42
5 2000 Angola Hydrological Flood Flood (General) 3 9011.00 15.00 1639896721.42
... ... ... ... ... ... ... ... ... ...
5834 2023 Viet Nam Meteorological Storm Tropical cyclone 1 3.00 1.00 1639896721.42
5835 2023 Yemen Hydrological Flood Flood (General) 1 169035.00 39.00 1639896721.42
5836 2023 Zambia Hydrological Flood Flash flood 1 154608.00 0.00 1639896721.42
5837 2023 Zambia Hydrological Flood Flood (General) 1 22000.00 0.00 1639896721.42
5838 2023 Zimbabwe Meteorological Storm Tropical cyclone 1 2.00 2.00 1639896721.42

5838 rows × 9 columns

Fehlende Werte ersetzen#

Vorteile

  • Verwendung möglichst aller Daten

  • genaue Analyse der fehlenden Werte kann zu neuen Erkenntnissen führen

Nachteile

  • keine Standard Methodik

  • eventuelle Verfälschung der Daten

Duplikate#

.duplicated() findet Zeilen mit genau den gleichen Werte.

Mit drop_duplicates() können diese entfernt werden.

data.duplicated().sum()
0
data.drop_duplicates()
Year Country Disaster Subroup Disaster Type Disaster Subtype Total Events Total Affected Total Deaths Total Damage (USD, original)
1 2000 Afghanistan Climatological Drought Drought 1 2580000.00 37.00 50000.00
2 2000 Algeria Hydrological Flood Flash flood 2 105.00 37.00 1639896721.42
3 2000 Algeria Hydrological Flood Flood (General) 2 100.00 7.00 1639896721.42
4 2000 Algeria Meteorological Storm Storm (General) 1 10.00 4.00 1639896721.42
5 2000 Angola Hydrological Flood Flood (General) 3 9011.00 15.00 1639896721.42
... ... ... ... ... ... ... ... ... ...
5834 2023 Viet Nam Meteorological Storm Tropical cyclone 1 3.00 1.00 1639896721.42
5835 2023 Yemen Hydrological Flood Flood (General) 1 169035.00 39.00 1639896721.42
5836 2023 Zambia Hydrological Flood Flash flood 1 154608.00 0.00 1639896721.42
5837 2023 Zambia Hydrological Flood Flood (General) 1 22000.00 0.00 1639896721.42
5838 2023 Zimbabwe Meteorological Storm Tropical cyclone 1 2.00 2.00 1639896721.42

5838 rows × 9 columns

Daten speichern & Neu laden#

data.to_csv('../../data/clean_emdat.csv')