Datenanalyse Teil II

Datenanalyse Teil II#

Natalie Widmann#

Wintersemester 2023 / 2024

Aggregated figures for Natural Disasters in EM-DAT#

Link: https://data.humdata.org/dataset/emdat-country-profiles

In 1988, the Centre for Research on the Epidemiology of Disasters (CRED) launched the Emergency Events Database (EM-DAT). EM-DAT was created with the initial support of the World Health Organisation (WHO) and the Belgian Government.

The main objective of the database is to serve the purposes of humanitarian action at national and international levels. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.

EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.

# Install a pip package im Jupyter Notebook
!pip3 install pandas
!pip3 install openpyxl

Requirement already satisfied: pandas in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (2.1.3)

Requirement already satisfied: numpy<2,>=1.23.2 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (1.26.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

Requirement already satisfied: openpyxl in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (3.1.2)
Requirement already satisfied: et-xmlfile in /home/natalie/Dokumente/Datenjournalismus in Python/Code/.venv/lib/python3.11/site-packages (from openpyxl) (1.1.0)

import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)

data = pd.read_excel('../../data/emdat.xlsx', engine="openpyxl")

data

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
0	#date +occurred	#country +name	#country +code	#cause +group	#cause +subgroup	#cause +type	#cause +subtype	#frequency	#affected +ind	#affected +ind +killed	NaN	#value +usd	NaN
1	2000	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.00	84975	58.84
2	2000	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	2	105	37	NaN	NaN	58.84
3	2000	Algeria	DZA	Natural	Hydrological	Flood	Flood (General)	2	100	7	NaN	NaN	58.84
4	2000	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	58.84
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5834	2023	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	1	3	1	NaN	NaN	NaN
5835	2023	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	1	169035	39	NaN	NaN	NaN
5836	2023	Zambia	ZMB	Natural	Hydrological	Flood	Flash flood	1	154608	NaN	NaN	NaN	NaN
5837	2023	Zambia	ZMB	Natural	Hydrological	Flood	Flood (General)	1	22000	NaN	NaN	NaN	NaN
5838	2023	Zimbabwe	ZWE	Natural	Meteorological	Storm	Tropical cyclone	1	NaN	2	NaN	NaN	NaN

5839 rows × 13 columns

Data Cleaning#

Entferne erste Zeile

data = data.drop(index=0)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          5838 non-null   object 
 1   Country                       5838 non-null   object 
 2   ISO                           5838 non-null   object 
 3   Disaster Group                5838 non-null   object 
 4   Disaster Subroup              5838 non-null   object 
 5   Disaster Type                 5838 non-null   object 
 6   Disaster Subtype              5838 non-null   object 
 7   Total Events                  5838 non-null   object 
 8   Total Affected                4692 non-null   object 
 9   Total Deaths                  4118 non-null   object 
 10  Total Damage (USD, original)  2032 non-null   float64
 11  Total Damage (USD, adjusted)  1999 non-null   object 
 12  CPI                           5648 non-null   float64
dtypes: float64(2), object(11)
memory usage: 593.1+ KB

Änderung des Datentyps für integeger und float Spalten

cols = ['Year', 'Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, adjusted)']
for col in cols:
    data[col] = pd.to_numeric(data[col])

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          5838 non-null   int64  
 1   Country                       5838 non-null   object 
 2   ISO                           5838 non-null   object 
 3   Disaster Group                5838 non-null   object 
 4   Disaster Subroup              5838 non-null   object 
 5   Disaster Type                 5838 non-null   object 
 6   Disaster Subtype              5838 non-null   object 
 7   Total Events                  5838 non-null   int64  
 8   Total Affected                4692 non-null   float64
 9   Total Deaths                  4118 non-null   float64
 10  Total Damage (USD, original)  2032 non-null   float64
 11  Total Damage (USD, adjusted)  1999 non-null   float64
 12  CPI                           5648 non-null   float64
dtypes: float64(5), int64(2), object(6)
memory usage: 593.1+ KB

Entferne irrelevante Spalten, hier: ISO, Disaster Group, Total Damage (USD, adjusted)

cols = ['ISO', 'Disaster Group', 'Total Damage (USD, adjusted)', 'CPI']
data.drop(cols, axis=1, inplace=True)

data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	2	105.00	37.00	NaN
3	2000	Algeria	Hydrological	Flood	Flood (General)	2	100.00	7.00	NaN
4	2000	Algeria	Meteorological	Storm	Storm (General)	1	10.00	4.00	NaN
5	2000	Angola	Hydrological	Flood	Flood (General)	3	9011.00	15.00	NaN
...	...	...	...	...	...	...	...	...	...
5834	2023	Viet Nam	Meteorological	Storm	Tropical cyclone	1	3.00	1.00	NaN
5835	2023	Yemen	Hydrological	Flood	Flood (General)	1	169035.00	39.00	NaN
5836	2023	Zambia	Hydrological	Flood	Flash flood	1	154608.00	NaN	NaN
5837	2023	Zambia	Hydrological	Flood	Flood (General)	1	22000.00	NaN	NaN
5838	2023	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	NaN	2.00	NaN

5838 rows × 9 columns

Fehlende Werte - NaNs Values#

Wo fehlen Werte im Datensatz? Wie viele fehlen?

len(data)

Wie viele Werte pro Spalte sind NaN?#

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5838 entries, 1 to 5838
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          5838 non-null   int64  
 1   Country                       5838 non-null   object 
 2   Disaster Subroup              5838 non-null   object 
 3   Disaster Type                 5838 non-null   object 
 4   Disaster Subtype              5838 non-null   object 
 5   Total Events                  5838 non-null   int64  
 6   Total Affected                4692 non-null   float64
 7   Total Deaths                  4118 non-null   float64
 8   Total Damage (USD, original)  2032 non-null   float64
dtypes: float64(3), int64(2), object(4)
memory usage: 410.6+ KB

data['Total Affected'].isnull()

     False
     False
     False
     False
     False
        ...  
  False
  False
  False
  False
   True
Name: Total Affected, Length: 5838, dtype: bool

data['Total Affected'].isnull().sum()

data.isnull()

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	True
3	False	False	False	False	False	False	False	False	True
4	False	False	False	False	False	False	False	False	True
5	False	False	False	False	False	False	False	False	True
...	...	...	...	...	...	...	...	...	...
5834	False	False	False	False	False	False	False	False	True
5835	False	False	False	False	False	False	False	False	True
5836	False	False	False	False	False	False	False	True	True
5837	False	False	False	False	False	False	False	True	True
5838	False	False	False	False	False	False	True	False	True

5838 rows × 9 columns

data.isnull().sum()

Year                               0
Country                            0
Disaster Subroup                   0
Disaster Type                      0
Disaster Subtype                   0
Total Events                       0
Total Affected                  1146
Total Deaths                    1720
Total Damage (USD, original)    3806
dtype: int64

Möglichkeiten zum Umgang mit fehlenden Werte#

Zeilen mit fehlenden Werten entfernen
fehlende Werte ersetzen

Alle Zeilen mit NaN Werten entfernen#

dropna()

clean_data = data.dropna()
clean_data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
6	2000	Angola	Hydrological	Flood	Riverine flood	1	70000.00	31.00	10000000.00
9	2000	Argentina	Hydrological	Flood	Flood (General)	3	31840.00	6.00	300000000.00
18	2000	Australia	Meteorological	Storm	Tropical cyclone	3	1350.00	1.00	150000000.00
22	2000	Azerbaijan	Geophysical	Earthquake	Ground movement	1	3294.00	31.00	10000000.00
...	...	...	...	...	...	...	...	...	...
5824	2023	United States of America	Climatological	Wildfire	Wildfire (General)	2	8497.00	130.00	3000000000.00
5827	2023	United States of America	Meteorological	Extreme temperature	Severe winter conditions	1	25.00	10.00	380000000.00
5828	2023	United States of America	Meteorological	Storm	Severe weather	1	551.00	3.00	250000000.00
5829	2023	United States of America	Meteorological	Storm	Storm (General)	3	912.00	37.00	7000000000.00
5830	2023	United States of America	Meteorological	Storm	Tornado	5	14217.00	75.00	19560000000.00

1500 rows × 9 columns

Entferne nur Zeilen die in einer bestimmten Spalte NaN Werte haben

clean_data = data.dropna(subset=['Disaster Subtype', 'Total Affected'])
clean_data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	2	105.00	37.00	NaN
3	2000	Algeria	Hydrological	Flood	Flood (General)	2	100.00	7.00	NaN
4	2000	Algeria	Meteorological	Storm	Storm (General)	1	10.00	4.00	NaN
5	2000	Angola	Hydrological	Flood	Flood (General)	3	9011.00	15.00	NaN
...	...	...	...	...	...	...	...	...	...
5833	2023	Viet Nam	Hydrological	Flood	Flood (General)	1	463.00	4.00	NaN
5834	2023	Viet Nam	Meteorological	Storm	Tropical cyclone	1	3.00	1.00	NaN
5835	2023	Yemen	Hydrological	Flood	Flood (General)	1	169035.00	39.00	NaN
5836	2023	Zambia	Hydrological	Flood	Flash flood	1	154608.00	NaN	NaN
5837	2023	Zambia	Hydrological	Flood	Flood (General)	1	22000.00	NaN	NaN

4692 rows × 9 columns

Fehlende Werte entfernen#

Vorteile

einfache Handhabung
verständliche Methodik

Nachteile

ein großer Teil der Daten wird nicht ausgewertet
fehlende Werte können eine ganz andere Geschichte erzählen
die Konsistenz über unterschiedliche Fragestellungen hinweg muss sichergestellt sein

Fehlende Werte ersetzen#

Mit .fillna() können alle NaN Werte in einem DataFrame oder einer Series automatisch ersetzt werden.

Das Argument inplace=True wendet die Methode direkt auf den DataFrame an.

data.fillna(0)

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	2	105.00	37.00	0.00
3	2000	Algeria	Hydrological	Flood	Flood (General)	2	100.00	7.00	0.00
4	2000	Algeria	Meteorological	Storm	Storm (General)	1	10.00	4.00	0.00
5	2000	Angola	Hydrological	Flood	Flood (General)	3	9011.00	15.00	0.00
...	...	...	...	...	...	...	...	...	...
5834	2023	Viet Nam	Meteorological	Storm	Tropical cyclone	1	3.00	1.00	0.00
5835	2023	Yemen	Hydrological	Flood	Flood (General)	1	169035.00	39.00	0.00
5836	2023	Zambia	Hydrological	Flood	Flash flood	1	154608.00	0.00	0.00
5837	2023	Zambia	Hydrological	Flood	Flood (General)	1	22000.00	0.00	0.00
5838	2023	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	0.00	2.00	0.00

5838 rows × 9 columns

Welche Werte für Total Affected, Total Deaths oder Total Damage?

0
arithmetisches Mittel
Median
Berechnung aus anderen Spalten

data["Total Deaths"].fillna(0, inplace=True)

data["Total Affected"].fillna(data['Total Deaths'], inplace=True)

data['Total Damage (USD, original)'].mean()

1639896721.4173229

data['Total Damage (USD, original)'] = data['Total Damage (USD, original)'].fillna(data['Total Damage (USD, original)'].mean())

data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	2	105.00	37.00	1639896721.42
3	2000	Algeria	Hydrological	Flood	Flood (General)	2	100.00	7.00	1639896721.42
4	2000	Algeria	Meteorological	Storm	Storm (General)	1	10.00	4.00	1639896721.42
5	2000	Angola	Hydrological	Flood	Flood (General)	3	9011.00	15.00	1639896721.42
...	...	...	...	...	...	...	...	...	...
5834	2023	Viet Nam	Meteorological	Storm	Tropical cyclone	1	3.00	1.00	1639896721.42
5835	2023	Yemen	Hydrological	Flood	Flood (General)	1	169035.00	39.00	1639896721.42
5836	2023	Zambia	Hydrological	Flood	Flash flood	1	154608.00	0.00	1639896721.42
5837	2023	Zambia	Hydrological	Flood	Flood (General)	1	22000.00	0.00	1639896721.42
5838	2023	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	2.00	2.00	1639896721.42

5838 rows × 9 columns

Fehlende Werte ersetzen#

Vorteile

Verwendung möglichst aller Daten
genaue Analyse der fehlenden Werte kann zu neuen Erkenntnissen führen

Nachteile

keine Standard Methodik
eventuelle Verfälschung der Daten

Duplikate#

.duplicated() findet Zeilen mit genau den gleichen Werte.

Mit drop_duplicates() können diese entfernt werden.

data.duplicated().sum()

data.drop_duplicates()

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)
1	2000	Afghanistan	Climatological	Drought	Drought	1	2580000.00	37.00	50000.00
2	2000	Algeria	Hydrological	Flood	Flash flood	2	105.00	37.00	1639896721.42
3	2000	Algeria	Hydrological	Flood	Flood (General)	2	100.00	7.00	1639896721.42
4	2000	Algeria	Meteorological	Storm	Storm (General)	1	10.00	4.00	1639896721.42
5	2000	Angola	Hydrological	Flood	Flood (General)	3	9011.00	15.00	1639896721.42
...	...	...	...	...	...	...	...	...	...
5834	2023	Viet Nam	Meteorological	Storm	Tropical cyclone	1	3.00	1.00	1639896721.42
5835	2023	Yemen	Hydrological	Flood	Flood (General)	1	169035.00	39.00	1639896721.42
5836	2023	Zambia	Hydrological	Flood	Flash flood	1	154608.00	0.00	1639896721.42
5837	2023	Zambia	Hydrological	Flood	Flood (General)	1	22000.00	0.00	1639896721.42
5838	2023	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	2.00	2.00	1639896721.42

5838 rows × 9 columns

Daten speichern & Neu laden#

data.to_csv('../../data/clean_emdat.csv')

Datenanalyse Teil II

Contents

Datenanalyse Teil II#

Natalie Widmann#

Aggregated figures for Natural Disasters in EM-DAT#

Data Cleaning#

Fehlende Werte - NaNs Values#

Wie viele Werte pro Spalte sind NaN?#

Möglichkeiten zum Umgang mit fehlenden Werte#

Alle Zeilen mit NaN Werten entfernen#

Fehlende Werte entfernen#

Fehlende Werte ersetzen#

Fehlende Werte ersetzen#

Duplikate#

Daten speichern & Neu laden#