Explorative Datenanalyse

Explorative Datenanalyse#

Ziele#

Kennenlernen der unterschiedlichen Schritte bei der Datenanalyse
kritisches Hinterfragen der Datenqualität
unterschiedliche Datenformate (csv, json, excel, txt) einlesen und speichern
strukturierte Daten mit Pandas bearbeiten und analysieren
Erstellung einfacher Visualisierungen
Verwendung weiterer Python Packages

Was sind Daten?#

Strukturierte Daten#

Strukturierte Daten sind gut organisiert und so formattiert, dass es einfach ist sie zu durchsuchen, sie maschinell zu lesen oder zu verarbeiten. Das einfachste Beispiel ist eine Tabelle in der jede Spalte eine Kategorie festlegt.

Unstrukturierte Daten#

Im Gegensatz dazu sind unstrukturierte Daten nicht in einem bestimmten Format oder einer festgelegten Struktur verfügbar. Dazu zählen Texte, Bilder, Social Media Feeds, aber auch Audio Files, etc.

Semi-Strukturierte Daten#

Semi-strukturierte Daten bilden eine Mischform. Beispielsweise eine Tabelle mit E-Mail Daten, in der Empfänger, Betreff, Datum und Absender strukturierte Informationen enthalten, der eigentliche Text jedoch unstrukturiert ist.

Unterschiedliche Ebenen an Information in Daten#

Daten

Quelle: Hot Butter Studio Brandon Rossen and Karyn Lurie (2012).

Datenpipeline#

Datenpipeline

Explorative Datenanalyse: Weltweite Naturkatastrophen#

Idee, Daten finden & Verifikation#

Datenpipeline1

Aggregated figures for Natural Disasters in EM-DAT#

Link: https://data.humdata.org/dataset/emdat-country-profiles

In 1988, the Centre for Research on the Epidemiology of Disasters (CRED) launched the Emergency Events Database (EM-DAT). EM-DAT was created with the initial support of the World Health Organisation (WHO) and the Belgian Government.

The main objective of the database is to serve the purposes of humanitarian action at national and international levels. The initiative aims to rationalise decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting.

EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. The database is compiled from various sources, including UN agencies, non-governmental organisations, insurance companies, research institutes and press agencies.

import pandas as pd
data = pd.read_excel('../../data/emdat.xlsx', engine="openpyxl")

data

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
0	#date +occurred	#country +name	#country +code	#cause +group	#cause +subgroup	#cause +type	#cause +subtype	#frequency	#affected +ind	#affected +ind +killed	NaN	#value +usd	NaN
1	2000	Afghanistan	AFG	Natural	Climatological	Drought	Drought	1	2580000	37	50000.0	84975	58.840648
2	2000	Algeria	DZA	Natural	Hydrological	Flood	Flash flood	2	105	37	NaN	NaN	58.840648
3	2000	Algeria	DZA	Natural	Hydrological	Flood	Flood (General)	2	100	7	NaN	NaN	58.840648
4	2000	Algeria	DZA	Natural	Meteorological	Storm	Storm (General)	1	10	4	NaN	NaN	58.840648
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5834	2023	Viet Nam	VNM	Natural	Meteorological	Storm	Tropical cyclone	1	3	1	NaN	NaN	NaN
5835	2023	Yemen	YEM	Natural	Hydrological	Flood	Flood (General)	1	169035	39	NaN	NaN	NaN
5836	2023	Zambia	ZMB	Natural	Hydrological	Flood	Flash flood	1	154608	NaN	NaN	NaN	NaN
5837	2023	Zambia	ZMB	Natural	Hydrological	Flood	Flood (General)	1	22000	NaN	NaN	NaN	NaN
5838	2023	Zimbabwe	ZWE	Natural	Meteorological	Storm	Tropical cyclone	1	NaN	2	NaN	NaN	NaN

5839 rows × 13 columns

Datenexploration und -bereinigung#

Datenpipeline1

Überblick über die Daten#

# head() gibt die ersten 5 Zeilen aus
data.head()

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
0	#date +occurred	#country +name	#country +code	#cause +group	#cause +subgroup	#cause +type	#cause +subtype	#frequency	#affected +ind	#affected +ind +killed	NaN	#value +usd	NaN
1	1900	Cabo Verde	CPV	Natural	Climatological	Drought	Drought	1	NaN	11000	NaN	NaN	3.077091
2	1900	India	IND	Natural	Climatological	Drought	Drought	1	NaN	1250000	NaN	NaN	3.077091
3	1900	Jamaica	JAM	Natural	Hydrological	Flood	NaN	1	NaN	300	NaN	NaN	3.077091
4	1900	Japan	JPN	Natural	Geophysical	Volcanic activity	Ash fall	1	NaN	30	NaN	NaN	3.077091

Wie groß ist der Datensatz? Wie viele Zeilen und wie viele Spalten sind vorhanden?

data.shape

(10343, 13)

print(f'Anzahl an Zeilen: {data.shape[0]}')
print(f'Anzahl an Spalten: {data.shape[1]}')

Anzahl an Zeilen: 10343
Anzahl an Spalten: 13

Die Spaltennamen

print(data.columns)

Index(['Year', 'Country', 'ISO', 'Disaster Group', 'Disaster Subroup',
       'Disaster Type', 'Disaster Subtype', 'Total Events', 'Total Affected',
       'Total Deaths', 'Total Damage (USD, original)',
       'Total Damage (USD, adjusted)', 'CPI'],
      dtype='object')

info() für mehr Infos über die Spalten

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10343 entries, 0 to 10342
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          10343 non-null  object 
 1   Country                       10343 non-null  object 
 2   ISO                           10343 non-null  object 
 3   Disaster Group                10343 non-null  object 
 4   Disaster Subroup              10343 non-null  object 
 5   Disaster Type                 10343 non-null  object 
 6   Disaster Subtype              8226 non-null   object 
 7   Total Events                  10343 non-null  object 
 8   Total Affected                7507 non-null   object 
 9   Total Deaths                  7318 non-null   object 
 10  Total Damage (USD, original)  3796 non-null   float64
 11  Total Damage (USD, adjusted)  3767 non-null   object 
 12  CPI                           10149 non-null  float64
dtypes: float64(2), object(11)
memory usage: 1.0+ MB
None

describe() zeigt die grundlegenden statistischen Eigenschaften von Spalten mit numerischem Datentyp, also int und float.

Die Methode berechnet:

die Anzahl an fehlenden Werten
Durchschnitt
Standardabweichung
Zahlenrange
Media
0.25 und 0.75 Quartile

data.describe()

	Total Damage (USD, original)	CPI
count	3.796000e+03	10149.000000
mean	1.058358e+09	58.812266
std	6.396224e+09	27.308606
min	2.000000e+03	3.077091
25%	9.988750e+06	40.450409
50%	6.700000e+07	63.549547
75%	4.000000e+08	80.472285
max	2.100000e+11	100.000000

.unique() zeigt die unterschiedlichen Werte einer Spalte an

data['Year'].unique()

array(['#date +occurred', 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907,
       1908, 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918,
       1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929,
       1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940,
       1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951,
       1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962,
       1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973,
       1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984,
       1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995,
       1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
       2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
       2018, 2019, 2020, 2021, 2022], dtype=object)

Data Cleaning Teil I: erste Zeile im DataFrame entfernen#

data.index

RangeIndex(start=0, stop=10343, step=1)

data = data.drop(index=0)

data

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
1	1900	Cabo Verde	CPV	Natural	Climatological	Drought	Drought	1	NaN	11000	NaN	NaN	3.077091
2	1900	India	IND	Natural	Climatological	Drought	Drought	1	NaN	1250000	NaN	NaN	3.077091
3	1900	Jamaica	JAM	Natural	Hydrological	Flood	NaN	1	NaN	300	NaN	NaN	3.077091
4	1900	Japan	JPN	Natural	Geophysical	Volcanic activity	Ash fall	1	NaN	30	NaN	NaN	3.077091
5	1900	Turkey	TUR	Natural	Geophysical	Earthquake	Ground movement	1	NaN	140	NaN	NaN	3.077091
...	...	...	...	...	...	...	...	...	...	...	...	...	...
10338	2022	Yemen	YEM	Natural	Hydrological	Flood	Flash flood	1	3400	13	NaN	NaN	NaN
10339	2022	South Africa	ZAF	Natural	Hydrological	Flood	NaN	7	143119	562	3.164000e+09	NaN	NaN
10340	2022	Zambia	ZMB	Natural	Hydrological	Flood	NaN	1	15000	3	NaN	NaN	NaN
10341	2022	Zimbabwe	ZWE	Natural	Hydrological	Flood	NaN	1	NaN	NaN	NaN	NaN	NaN
10342	2022	Zimbabwe	ZWE	Natural	Meteorological	Storm	Tropical cyclone	1	3000	NaN	NaN	NaN	NaN

10342 rows × 13 columns

Datentypen abfragen und anpassen#

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10342 entries, 1 to 10342
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          10342 non-null  object 
 1   Country                       10342 non-null  object 
 2   ISO                           10342 non-null  object 
 3   Disaster Group                10342 non-null  object 
 4   Disaster Subroup              10342 non-null  object 
 5   Disaster Type                 10342 non-null  object 
 6   Disaster Subtype              8225 non-null   object 
 7   Total Events                  10342 non-null  object 
 8   Total Affected                7506 non-null   object 
 9   Total Deaths                  7317 non-null   object 
 10  Total Damage (USD, original)  3796 non-null   float64
 11  Total Damage (USD, adjusted)  3766 non-null   object 
 12  CPI                           10149 non-null  float64
dtypes: float64(2), object(11)
memory usage: 1.0+ MB

# Datentyp Abfrage mit dem Attribut
data['Year'].dtype

dtype('O')

Data Cleaning Teil II: Datentypen umwandeln#

# Umwandlung des Datentyp
data["Year"] = pd.to_numeric(data["Year"])
data['Year'].dtype

dtype('int64')

# Auf alle integer und float Spalten anwenden
cols = ['Total Events', 'Total Affected', 'Total Deaths', 'Total Damage (USD, adjusted)']
for col in cols:
    data[col] = pd.to_numeric(data[col])

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10342 entries, 1 to 10342
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          10342 non-null  int64  
 1   Country                       10342 non-null  object 
 2   ISO                           10342 non-null  object 
 3   Disaster Group                10342 non-null  object 
 4   Disaster Subroup              10342 non-null  object 
 5   Disaster Type                 10342 non-null  object 
 6   Disaster Subtype              8225 non-null   object 
 7   Total Events                  10342 non-null  int64  
 8   Total Affected                7506 non-null   float64
 9   Total Deaths                  7317 non-null   float64
 10  Total Damage (USD, original)  3796 non-null   float64
 11  Total Damage (USD, adjusted)  3766 non-null   float64
 12  CPI                           10149 non-null  float64
dtypes: float64(5), int64(2), object(6)
memory usage: 1.0+ MB

Überblick über die numerischen Daten#

data.describe()

	Year	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
count	10342.000000	10342.000000	7.506000e+03	7.317000e+03	3.796000e+03	3.766000e+03	10149.000000
mean	1995.349836	1.444885	1.123573e+06	3.122315e+03	1.058358e+09	1.591526e+09	58.812266
std	21.975130	1.236398	9.804163e+06	7.284039e+04	6.396224e+09	8.361409e+09	27.308606
min	1900.000000	1.000000	1.000000e+00	1.000000e+00	2.000000e+03	2.287000e+03	3.077091
25%	1986.000000	1.000000	1.200000e+03	6.000000e+00	9.988750e+06	1.828097e+07	40.450409
50%	2000.000000	1.000000	1.141400e+04	2.300000e+01	6.700000e+07	1.360379e+08	63.549547
75%	2011.000000	1.000000	1.169720e+05	9.100000e+01	4.000000e+08	7.188042e+08	80.472285
max	2022.000000	20.000000	3.300000e+08	3.700000e+06	2.100000e+11	2.529734e+11	100.000000

Überblick über die Objekt Daten#

# Unterschiedliche Länder
countries = data['Country'].unique()
print(countries)

['Cabo Verde' 'India' 'Jamaica' 'Japan' 'Turkey'
 'United States of America (the)' 'Azerbaijan' 'China' 'Guatemala'
 'Myanmar' 'Martinique' 'Soviet Union' 'Saint Vincent and the Grenadines'
 'Canada' 'Comoros (the)' 'Iran (Islamic Republic of)' 'Israel'
 'Niger (the)' 'Bangladesh' 'Greece' 'Taiwan (Province of China)'
 'Albania' 'Italy' 'Philippines (the)' 'Belgium' 'Chile' 'Colombia'
 'Hong Kong' 'Romania' 'France' 'Haiti' 'Morocco' 'Pakistan' 'Portugal'
 'Burkina Faso' 'Costa Rica' 'Algeria' 'Gambia (the)' 'Guinea-Bissau'
 'Mali' 'Mauritania' 'Senegal' 'Chad' 'Kazakhstan' 'Mexico' 'Tajikistan'
 'Indonesia' 'Peru' 'Tokelau' 'Puerto Rico' 'Anguilla' 'Argentina'
 'Germany Fed Rep' 'Ecuador' 'Bahamas (the)' 'Cuba' 'Egypt' 'Jordan'
 'Bulgaria' 'Guadeloupe' 'Saint Kitts and Nevis' 'Montserrat' 'Poland'
 'New Zealand' 'Dominica' 'Dominican Republic (the)' 'Nicaragua' 'Armenia'
 'Belize' 'Fiji' 'Honduras' 'Solomon Islands' 'Trinidad and Tobago'
 'El Salvador' 'Korea (the Republic of)' 'Norway' 'Papua New Guinea'
 'Australia' 'Ghana' 'New Caledonia' 'Vanuatu' 'Spain' 'Yemen Arab Rep'
 'Cook Islands (the)' 'Yugoslavia' 'Libya' 'Azores Islands' 'Tonga'
 'Bermuda' 'Brazil' 'Réunion' 'Antigua and Barbuda'
 'Venezuela (Bolivarian Republic of)' 'Switzerland' 'Austria'
 'United Kingdom of Great Britain and Northern Ireland (the)'
 'South Africa' 'Cyprus' 'Netherlands (the)' 'Viet Nam' 'Afghanistan'
 'Netherlands Antilles' 'Iraq' 'Nepal' 'Barbados' 'Lebanon' 'Mozambique'
 'Sri Lanka' 'Mongolia' 'Canary Is' 'Tunisia' 'French Polynesia' 'Niue'
 'Saint Lucia' 'Mauritius' 'Ethiopia' 'Somalia' 'Guam' 'Thailand'
 'Grenada' 'Paraguay' 'Kenya' 'Panama' 'Saudi Arabia'
 'Tanzania, United Republic of' 'Samoa' 'Bolivia (Plurinational State of)'
 'Botswana' 'Malaysia' 'Wallis and Futuna' 'American Samoa'
 'Congo (the Democratic Republic of the)'
 "Lao People's Democratic Republic (the)" 'Togo' 'Uganda' 'Congo (the)'
 'Malawi' 'Syrian Arab Republic' 'Uruguay' 'Lesotho' 'Madagascar' 'Benin'
 'Suriname' 'Hungary' 'Cameroon' 'Guyana' 'Kiribati' 'Tuvalu'
 'Yemen P Dem Rep' 'Central African Republic' 'Iceland' 'Czechoslovakia'
 'Rwanda' 'Sudan (the)' 'Sierra Leone' 'Zimbabwe' 'Denmark' 'Sweden'
 'Djibouti' 'Oman' 'Nigeria' 'Zambia' 'Côte d’Ivoire' 'Guinea' 'Liberia'
 'Angola' 'Namibia' 'Eswatini' 'Ireland' 'Luxembourg'
 'Sao Tome and Principe' 'Germany Dem Rep'
 'Turks and Caicos Islands (the)' 'Micronesia (Federated States of)'
 'Cambodia' 'Maldives' "Korea (the Democratic People's Republic of)"
 'Gabon' 'Burundi' 'Virgin Island (U.S.)' 'Germany' 'Finland' 'Georgia'
 'Marshall Islands (the)' 'Yemen' 'Kyrgyzstan' 'Lithuania'
 'Russian Federation (the)' 'Serbia Montenegro' 'Ukraine' 'Uzbekistan'
 'Belarus' 'Eritrea' 'Macao' 'Macedonia (the former Yugoslav Republic of)'
 'Turkmenistan' 'Bhutan' 'Moldova (the Republic of)'
 'Czech Republic (the)' 'French Guiana' 'Croatia'
 'Virgin Island (British)' 'Kuwait' 'Slovakia' 'Seychelles'
 'Brunei Darussalam' 'Slovenia' 'Bosnia and Herzegovina' 'Latvia'
 'Cayman Islands (the)' 'Saint Helena, Ascension and Tristan da Cunha'
 'Timor-Leste' 'Northern Mariana Islands (the)' 'Estonia' 'Montenegro'
 'Serbia' 'Palestine, State of' 'South Sudan' 'Palau'
 'United Arab Emirates (the)' 'Saint Barthélemy'
 'Saint Martin (French Part)' 'Sint Maarten (Dutch part)' 'Qatar'
 'Isle of Man']

print(len(countries))

# Vorkommen von Ländern der Liste
'Germany' in countries

True

# Vorkommen von Deutschland
for country in countries:
    if 'german' in country.lower():
        print(country)

Germany Fed Rep
Germany Dem Rep
Germany

Data Cleaning Teil III: `Germany` vereinheitlichen#

Schritt: Schreibe eine Funktion die dies bereinigt.

Eingabe: beliebiger Ländername (bspw. Indonesia, Belgium, Germany Fed Rep, etc.)

Ausgabe: der Ländername oder Germany falls einer der oben gelisteten Namen eingegeben wurde

Schritt: Wende diese Funktion auf alle Ländernamen in der Spalte Country an

# Funktion
def clean_country(land):
    if 'german' in land.lower():
        return 'Germany'
    else:
        return land

# Test der Funktion mit unterschiedlichen Werten
text = 'Germany Fed Rep'
clean_text = clean_country(text)
print(clean_text)

text = 'Mexico'
clean_text = clean_country(text)
print(clean_text)

Germany
Mexico

Anwendung der Funktion auf alle Werte einer Spalte

mit apply() kann eine Funktion auf eine komplette Spalte oder Zeile des Dataframes angewendet werden

data['Country'] = data['Country'].apply(clean_country)

data[data['Country'] == 'Germany']

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
140	1920	Germany	DFR	Natural	Hydrological	Flood	NaN	1	NaN	NaN	2.000000e+07	2.699188e+08	7.409635
170	1925	Germany	DFR	Natural	Hydrological	Flood	NaN	1	NaN	NaN	1.900000e+07	2.929157e+08	6.486508
183	1926	Germany	DFR	Natural	Hydrological	Flood	NaN	1	NaN	NaN	3.100000e+07	4.734234e+08	6.548049
219	1929	Germany	DFR	Natural	Meteorological	Storm	Convective storm	1	NaN	NaN	5.500000e+07	8.659896e+08	6.351116
549	1953	Germany	DFR	Natural	Meteorological	Storm	Convective storm	1	NaN	NaN	2.500000e+07	2.526293e+08	9.895924
...	...	...	...	...	...	...	...	...	...	...	...	...	...
9469	2019	Germany	DEU	Natural	Meteorological	Storm	Convective storm	1	NaN	1.0	NaN	NaN	94.349092
9708	2020	Germany	DEU	Natural	Meteorological	Storm	Extra-tropical storm	1	33.0	NaN	NaN	NaN	95.512967
9952	2021	Germany	DEU	Natural	Hydrological	Flood	NaN	1	1000.0	197.0	4.000000e+10	4.000000e+10	100.000000
9953	2021	Germany	DEU	Natural	Meteorological	Storm	Convective storm	2	604.0	1.0	NaN	NaN	100.000000
10188	2022	Germany	DEU	Natural	Meteorological	Storm	Extra-tropical storm	3	2.0	7.0	NaN	NaN	NaN

87 rows × 13 columns

len(data['Country'].unique())

Weiter mit der Datenexploration#

 data['Disaster Group'].unique()

array(['Natural'], dtype=object)

–> alle Zeilen besitzen in der Spalte Disaster Group den Wert Natural. Das heißt für die weitere Analyse steckt keine relevante Information in dieser Spalte.

.value_counts() zeigt wie oft eine Spalte die unterschiedlichen Werte annimmt.

data['Disaster Subroup'].value_counts()

Hydrological      4442
Meteorological    3304
Geophysical       1355
Climatological    1148
Biological          93
Name: Disaster Subroup, dtype: int64

Mit dem Argument normalize=True wird das Vorkommen der Werte automatisch ins Verhältnis gesetzt.

data['Disaster Subroup'].value_counts(normalize=True)

Hydrological      0.429511
Meteorological    0.319474
Geophysical       0.131019
Climatological    0.111004
Biological        0.008992
Name: Disaster Subroup, dtype: float64

data['Disaster Type'].value_counts()

Flood                    3802
Storm                    2746
Earthquake               1078
Drought                   774
Landslide                 640
Extreme temperature       557
Wildfire                  372
Volcanic activity         230
Insect infestation         92
Mass movement (dry)        47
Glacial lake outburst       2
Fog                         1
Animal accident             1
Name: Disaster Type, dtype: int64

data['Disaster Type'].value_counts(normalize=True)

Flood                    0.367627
Storm                    0.265519
Earthquake               0.104235
Drought                  0.074840
Landslide                0.061884
Extreme temperature      0.053858
Wildfire                 0.035970
Volcanic activity        0.022239
Insect infestation       0.008896
Mass movement (dry)      0.004545
Glacial lake outburst    0.000193
Fog                      0.000097
Animal accident          0.000097
Name: Disaster Type, dtype: float64

Dataframes Sortieren#

Dataframes können anhand einer oder meherer Spalten sortiert werden.

data.sort_values(by="Total Affected")

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
7790	2011	Italy	ITA	Natural	Hydrological	Landslide	NaN	1	1.0	3.0	NaN	NaN	83.012674
6781	2007	China	CHN	Natural	Hydrological	Landslide	Landslide	1	1.0	33.0	NaN	NaN	76.518679
6300	2005	Spain	ESP	Natural	Climatological	Wildfire	Forest fire	1	1.0	11.0	2.050000e+09	2.844401e+09	72.071410
9866	2020	Taiwan (Province of China)	TWN	Natural	Meteorological	Storm	Tropical cyclone	1	1.0	1.0	NaN	NaN	95.512967
6767	2007	Barbados	BRB	Natural	Geophysical	Earthquake	Ground movement	1	1.0	NaN	NaN	NaN	76.518679
...	...	...	...	...	...	...	...	...	...	...	...	...	...
10293	2022	Puerto Rico	PRI	Natural	Meteorological	Storm	Tropical cyclone	1	NaN	25.0	NaN	NaN	NaN
10294	2022	Portugal	PRT	Natural	Climatological	Drought	Drought	1	NaN	NaN	NaN	NaN	NaN
10324	2022	Uruguay	URY	Natural	Hydrological	Flood	NaN	1	NaN	NaN	NaN	NaN	NaN
10325	2022	United States of America (the)	USA	Natural	Climatological	Drought	Drought	1	NaN	NaN	3.000000e+09	NaN	NaN
10341	2022	Zimbabwe	ZWE	Natural	Hydrological	Flood	NaN	1	NaN	NaN	NaN	NaN	NaN

10342 rows × 13 columns

# 10 schlimmsten Naturkatastrophen
data.sort_values(by="Total Affected", ascending=False).head(n=10)

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
8610	2015	India	IND	Natural	Climatological	Drought	Drought	1	330000000.0	NaN	3.000000e+09	3.429750e+09	87.469932
2721	1987	India	IND	Natural	Climatological	Drought	Drought	1	300000000.0	300.0	NaN	NaN	41.932736
5563	2002	India	IND	Natural	Climatological	Drought	Drought	1	300000000.0	NaN	9.107220e+08	1.371942e+09	66.381964
4482	1998	China	CHN	Natural	Hydrological	Flood	Riverine flood	3	241134300.0	3760.0	3.099870e+10	5.152931e+10	60.157411
3274	1991	China	CHN	Natural	Hydrological	Flood	Riverine flood	2	210232227.0	1835.0	8.030000e+09	1.597665e+10	50.260853
1319	1972	India	IND	Natural	Climatological	Drought	Drought	1	200000000.0	NaN	1.000000e+08	6.479944e+08	15.432233
5778	2003	China	CHN	Natural	Hydrological	Flood	Riverine flood	6	155924986.0	662.0	1.532964e+10	2.258048e+10	67.888896
4107	1996	China	CHN	Natural	Hydrological	Flood	Riverine flood	2	154634000.0	3975.0	1.891450e+10	3.267616e+10	57.884706
7490	2010	China	CHN	Natural	Hydrological	Flood	Riverine flood	5	140194136.0	1911.0	1.817100e+10	2.258044e+10	80.472285
3618	1993	India	IND	Natural	Hydrological	Flood	NaN	1	128000000.0	827.0	7.000000e+09	1.313035e+10	53.311620

# Mehrere Argumente zum Sortieren sind möglich
data.sort_values(by=["Disaster Type", "Total Affected"], ascending=[True, False]).head(n=10)

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
8426	2014	Niger (the)	NER	Natural	Biological	Animal accident	NaN	1	5.0	12.0	NaN	NaN	87.366298
8610	2015	India	IND	Natural	Climatological	Drought	Drought	1	330000000.0	NaN	3.000000e+09	3.429750e+09	87.469932
2721	1987	India	IND	Natural	Climatological	Drought	Drought	1	300000000.0	300.0	NaN	NaN	41.932736
5563	2002	India	IND	Natural	Climatological	Drought	Drought	1	300000000.0	NaN	9.107220e+08	1.371942e+09	66.381964
1319	1972	India	IND	Natural	Climatological	Drought	Drought	1	200000000.0	NaN	1.000000e+08	6.479944e+08	15.432233
913	1965	India	IND	Natural	Climatological	Drought	Drought	1	100000000.0	1500000.0	1.000000e+08	8.599938e+08	11.627990
2086	1982	India	IND	Natural	Climatological	Drought	Drought	1	100000000.0	NaN	NaN	NaN	35.612841
3753	1994	China	CHN	Natural	Climatological	Drought	Drought	2	88690000.0	NaN	1.375520e+10	2.514584e+10	54.701693
5485	2002	China	CHN	Natural	Climatological	Drought	Drought	3	64560000.0	NaN	1.210000e+09	1.822784e+09	66.381964
7271	2009	China	CHN	Natural	Climatological	Drought	Drought	2	60160000.0	NaN	3.600000e+09	4.546959e+09	79.173803

Indexing and Retriving Data#

Auf die Werte einer Spalte kann <dataframe>['<spaltenname>'] zugegriffen werden.

data['Year']

      1900
      1900
      1900
      1900
      1900
         ... 
  2022
  2022
  2022
  2022
  2022
Name: Year, Length: 10342, dtype: int64

Darauf können weitere Operationen oder Methoden angewendet werden:

data['Year'] + 10

      1910
      1910
      1910
      1910
      1910
         ... 
  2032
  2032
  2032
  2032
  2032
Name: Year, Length: 10342, dtype: int64

data['Year'].mean()

1995.3498356217367

Mehrere Spalten werden ausgewählt indem eine Liste von Spaltennamen übergeben wird

data[['Year', 'Country', 'Disaster Type', 'Total Affected']]

	Year	Country	Disaster Type	Total Affected
1	1900	Cabo Verde	Drought	NaN
2	1900	India	Drought	NaN
3	1900	Jamaica	Flood	NaN
4	1900	Japan	Volcanic activity	NaN
5	1900	Turkey	Earthquake	NaN
...	...	...	...	...
10338	2022	Yemen	Flood	3400.0
10339	2022	South Africa	Flood	143119.0
10340	2022	Zambia	Flood	15000.0
10341	2022	Zimbabwe	Flood	NaN
10342	2022	Zimbabwe	Storm	3000.0

10342 rows × 4 columns

Boolean Indexing#

Die ausgewählten Daten können auch gefilteret werden, in dem eine Bedingung mitgegeben wird.

data[data['Country'] == 'Germany']

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
3121	1990	Germany	DEU	Natural	Meteorological	Storm	NaN	6	NaN	64.0	4.440000e+09	9.208027e+09	48.218797
3289	1991	Germany	DEU	Natural	Meteorological	Storm	NaN	1	NaN	NaN	5.000000e+06	9.948100e+06	50.260853
3431	1992	Germany	DEU	Natural	Geophysical	Earthquake	Ground movement	1	1525.0	1.0	5.000000e+07	9.655648e+07	51.783162
3432	1992	Germany	DEU	Natural	Hydrological	Flood	NaN	1	NaN	NaN	3.010000e+07	5.812700e+07	51.783162
3582	1993	Germany	DEU	Natural	Hydrological	Flood	Riverine flood	1	100000.0	5.0	6.000000e+08	1.125458e+09	53.311620
...	...	...	...	...	...	...	...	...	...	...	...	...	...
9469	2019	Germany	DEU	Natural	Meteorological	Storm	Convective storm	1	NaN	1.0	NaN	NaN	94.349092
9708	2020	Germany	DEU	Natural	Meteorological	Storm	Extra-tropical storm	1	33.0	NaN	NaN	NaN	95.512967
9952	2021	Germany	DEU	Natural	Hydrological	Flood	NaN	1	1000.0	197.0	4.000000e+10	4.000000e+10	100.000000
9953	2021	Germany	DEU	Natural	Meteorological	Storm	Convective storm	2	604.0	1.0	NaN	NaN	100.000000
10188	2022	Germany	DEU	Natural	Meteorological	Storm	Extra-tropical storm	3	2.0	7.0	NaN	NaN	NaN

66 rows × 13 columns

data[data['Total Deaths'] >= 1000]

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
1	1900	Cabo Verde	CPV	Natural	Climatological	Drought	Drought	1	NaN	11000.0	NaN	NaN	3.077091
2	1900	India	IND	Natural	Climatological	Drought	Drought	1	NaN	1250000.0	NaN	NaN	3.077091
6	1900	United States of America (the)	USA	Natural	Meteorological	Storm	Tropical cyclone	1	NaN	6000.0	3.000000e+07	9.749468e+08	3.077091
9	1902	China	CHN	Natural	Geophysical	Earthquake	Ground movement	1	NaN	2500.0	NaN	NaN	3.200175
10	1902	Guatemala	GTM	Natural	Geophysical	Earthquake	Ground movement	1	NaN	2000.0	2.500000e+07	7.812074e+08	3.200175
...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996	2021	India	IND	Natural	Hydrological	Flood	NaN	8	1324439.0	1520.0	3.200000e+09	3.200000e+09	100.000000
10150	2022	Afghanistan	AFG	Natural	Geophysical	Earthquake	Ground movement	2	368627.0	1064.0	NaN	NaN	NaN
10226	2022	India	IND	Natural	Hydrological	Flood	NaN	3	2101260.0	2098.0	NaN	NaN	NaN
10278	2022	Pakistan	PAK	Natural	Hydrological	Flood	NaN	1	33012865.0	1730.0	NaN	NaN	NaN
10320	2022	Uganda	UGA	Natural	Climatological	Drought	Drought	1	518000.0	2465.0	NaN	NaN	NaN

378 rows × 13 columns

Wie viele Menschen sind im Schnitt pro Erdbeben betroffen?

data[data['Disaster Type'] == 'Earthquake']

	Year	Country	ISO	Disaster Group	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	Total Damage (USD, adjusted)	CPI
5	1900	Turkey	TUR	Natural	Geophysical	Earthquake	Ground movement	1	NaN	140.0	NaN	NaN	3.077091
7	1901	Japan	JPN	Natural	Geophysical	Earthquake	Tsunami	1	24.0	18.0	NaN	NaN	3.077091
8	1902	Azerbaijan	AZE	Natural	Geophysical	Earthquake	Ground movement	1	17540.0	86.0	NaN	NaN	3.200175
9	1902	China	CHN	Natural	Geophysical	Earthquake	Ground movement	1	NaN	2500.0	NaN	NaN	3.200175
10	1902	Guatemala	GTM	Natural	Geophysical	Earthquake	Ground movement	1	NaN	2000.0	25000000.0	781207389.0	3.200175
...	...	...	...	...	...	...	...	...	...	...	...	...	...
10276	2022	Pakistan	PAK	Natural	Geophysical	Earthquake	Ground movement	1	NaN	43.0	NaN	NaN	NaN
10281	2022	Peru	PER	Natural	Geophysical	Earthquake	Ground movement	2	602.0	NaN	NaN	NaN	NaN
10285	2022	Philippines (the)	PHL	Natural	Geophysical	Earthquake	Ground movement	2	650278.0	10.0	2272000.0	NaN	NaN
10291	2022	Papua New Guinea	PNG	Natural	Geophysical	Earthquake	Ground movement	1	1969.0	7.0	NaN	NaN	NaN
10317	2022	Taiwan (Province of China)	TWN	Natural	Geophysical	Earthquake	Ground movement	1	140.0	1.0	3700000.0	NaN	NaN

1078 rows × 13 columns

data[data['Disaster Type'] == 'Earthquake']['Total Affected']

           NaN
          24.0
       17540.0
           NaN
          NaN
           ...   
       NaN
     602.0
  650278.0
    1969.0
     140.0
Name: Total Affected, Length: 1078, dtype: float64

data[data['Disaster Type'] == 'Earthquake']['Total Affected'].mean()

233219.1548974943

Fehlende Werte - NaNs Values#

Wo fehlen Werte im Datensatz? Wie viele fehlen?

len(data)

Wie viele Werte pro Spalte sind NaN?#

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10342 entries, 1 to 10342
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Year                          10342 non-null  int64  
 1   Country                       10342 non-null  object 
 2   Disaster Subroup              10342 non-null  object 
 3   Disaster Type                 10342 non-null  object 
 4   Disaster Subtype              8225 non-null   object 
 5   Total Events                  10342 non-null  int64  
 6   Total Affected                7506 non-null   float64
 7   Total Deaths                  7317 non-null   float64
 8   Total Damage (USD, original)  3796 non-null   float64
 9   CPI                           10149 non-null  float64
dtypes: float64(4), int64(2), object(4)
memory usage: 808.1+ KB

data.isnull().sum()

Year                               0
Country                            0
Disaster Subroup                   0
Disaster Type                      0
Disaster Subtype                2117
Total Events                       0
Total Affected                  2836
Total Deaths                    3025
Total Damage (USD, original)    6546
CPI                              193
dtype: int64

data['Disaster Subtype'].isnull()

      False
      False
       True
      False
      False
         ...  
  False
   True
   True
   True
  False
Name: Disaster Subtype, Length: 10342, dtype: bool

data.isnull().sum()

Year                               0
Country                            0
Disaster Subroup                   0
Disaster Type                      0
Disaster Subtype                2117
Total Events                       0
Total Affected                  2836
Total Deaths                    3025
Total Damage (USD, original)    6546
CPI                              193
dtype: int64

Möglichkeiten zum Umgang mit fehlenden Werte#

Zeilen mit fehlenden Werten entfernen
fehlende Werte ersetzen

Alle Zeilen mit NaN Werten entfernen#

dropna()

clean_data = data.dropna()
clean_data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	CPI
45	1906	United States of America (the)	Geophysical	Earthquake	Ground movement	1	84564.0	700.0	5.240000e+08	3.323258
48	1907	Jamaica	Geophysical	Earthquake	Ground movement	1	90000.0	1200.0	3.000000e+07	3.446342
50	1908	Italy	Geophysical	Earthquake	Ground movement	1	150000.0	75000.0	1.160000e+08	3.323258
90	1912	Canada	Meteorological	Storm	Convective storm	1	2700.0	28.0	5.000000e+06	3.569425
146	1922	Canada	Climatological	Wildfire	Forest fire	1	11000.0	43.0	8.000000e+06	6.191107
...	...	...	...	...	...	...	...	...	...	...
10127	2021	United States of America (the)	Hydrological	Flood	Riverine flood	1	51.0	1.0	1.500000e+08	100.000000
10128	2021	United States of America (the)	Meteorological	Storm	Convective storm	18	1036.0	390.0	5.253000e+10	100.000000
10129	2021	United States of America (the)	Meteorological	Storm	Tropical cyclone	5	14200.0	104.0	6.905000e+10	100.000000
10131	2021	United States of America (the)	Climatological	Wildfire	Forest fire	2	3397.0	3.0	4.500000e+09	100.000000
10139	2021	Viet Nam	Meteorological	Storm	Tropical cyclone	4	83560.0	11.0	2.650000e+07	100.000000

2115 rows × 10 columns

Entferne nur Zeilen die in einer bestimmten Spalte NaN Werte haben

clean_data = data.dropna(subset=['Disaster Subtype', 'Total Affected'])
clean_data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	CPI
7	1901	Japan	Geophysical	Earthquake	Tsunami	1	24.0	18.0	NaN	3.077091
8	1902	Azerbaijan	Geophysical	Earthquake	Ground movement	1	17540.0	86.0	NaN	3.200175
15	1902	Soviet Union	Geophysical	Earthquake	Ground movement	1	125112.0	4562.0	NaN	3.200175
17	1903	Canada	Geophysical	Mass movement (dry)	Rockfall	1	23.0	76.0	NaN	3.323258
23	1903	Turkey	Geophysical	Earthquake	Ground movement	3	60000.0	6260.0	NaN	3.323258
...	...	...	...	...	...	...	...	...	...	...
10333	2022	Venezuela (Bolivarian Republic of)	Meteorological	Storm	Convective storm	1	30000.0	NaN	NaN	NaN
10335	2022	Viet Nam	Meteorological	Storm	Convective storm	2	54912.0	11.0	NaN	NaN
10336	2022	Viet Nam	Meteorological	Storm	Tropical cyclone	1	436978.0	19.0	NaN	NaN
10338	2022	Yemen	Hydrological	Flood	Flash flood	1	3400.0	13.0	NaN	NaN
10342	2022	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	3000.0	NaN	NaN	NaN

5929 rows × 10 columns

Fehlende Werte entfernen#

Vorteile

einfache Handhabung
verständliche Methodik

Nachteile

ein großer Teil der Daten wird nicht ausgewertet
fehlende Werte können eine ganz andere Geschichte erzählen
die Konsistenz über unterschiedliche Fragestellungen hinweg muss sichergestellt sein

Fehlende Werte ersetzen#

Mit .fillna() können alle NaN Werte in einem DataFrame oder einer Series automatisch ersetzt werden.

Das Argument inplace=True wendet die Methode direkt auf den DataFrame an.

data.fillna(0)

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	CPI
1	1900	Cabo Verde	Climatological	Drought	Drought	1	0.0	11000.0	0.000000e+00	3.077091
2	1900	India	Climatological	Drought	Drought	1	0.0	1250000.0	0.000000e+00	3.077091
3	1900	Jamaica	Hydrological	Flood	0	1	0.0	300.0	0.000000e+00	3.077091
4	1900	Japan	Geophysical	Volcanic activity	Ash fall	1	0.0	30.0	0.000000e+00	3.077091
5	1900	Turkey	Geophysical	Earthquake	Ground movement	1	0.0	140.0	0.000000e+00	3.077091
...	...	...	...	...	...	...	...	...	...	...
10338	2022	Yemen	Hydrological	Flood	Flash flood	1	3400.0	13.0	0.000000e+00	0.000000
10339	2022	South Africa	Hydrological	Flood	0	7	143119.0	562.0	3.164000e+09	0.000000
10340	2022	Zambia	Hydrological	Flood	0	1	15000.0	3.0	0.000000e+00	0.000000
10341	2022	Zimbabwe	Hydrological	Flood	0	1	0.0	0.0	0.000000e+00	0.000000
10342	2022	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	3000.0	0.0	0.000000e+00	0.000000

10342 rows × 10 columns

data["Disaster Subtype"].fillna("No Subtype", inplace=True)

data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	CPI
1	1900	Cabo Verde	Climatological	Drought	Drought	1	NaN	11000.0	NaN	3.077091
2	1900	India	Climatological	Drought	Drought	1	NaN	1250000.0	NaN	3.077091
3	1900	Jamaica	Hydrological	Flood	No Subtype	1	NaN	300.0	NaN	3.077091
4	1900	Japan	Geophysical	Volcanic activity	Ash fall	1	NaN	30.0	NaN	3.077091
5	1900	Turkey	Geophysical	Earthquake	Ground movement	1	NaN	140.0	NaN	3.077091
...	...	...	...	...	...	...	...	...	...	...
10338	2022	Yemen	Hydrological	Flood	Flash flood	1	3400.0	13.0	NaN	NaN
10339	2022	South Africa	Hydrological	Flood	No Subtype	7	143119.0	562.0	3.164000e+09	NaN
10340	2022	Zambia	Hydrological	Flood	No Subtype	1	15000.0	3.0	NaN	NaN
10341	2022	Zimbabwe	Hydrological	Flood	No Subtype	1	NaN	NaN	NaN	NaN
10342	2022	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	3000.0	NaN	NaN	NaN

10342 rows × 10 columns

Welche Werte für Total Affected, Total Deaths oder Total Damage?

0
arithmetisches Mittel
Median
Berechnung aus anderen Spalten

data["Total Deaths"].fillna(0, inplace=True)

data["Total Affected"].fillna(data['Total Deaths'], inplace=True)

data

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	CPI
1	1900	Cabo Verde	Climatological	Drought	Drought	1	11000.0	11000.0	NaN	3.077091
2	1900	India	Climatological	Drought	Drought	1	1250000.0	1250000.0	NaN	3.077091
3	1900	Jamaica	Hydrological	Flood	No Subtype	1	300.0	300.0	NaN	3.077091
4	1900	Japan	Geophysical	Volcanic activity	Ash fall	1	30.0	30.0	NaN	3.077091
5	1900	Turkey	Geophysical	Earthquake	Ground movement	1	140.0	140.0	NaN	3.077091
...	...	...	...	...	...	...	...	...	...	...
10338	2022	Yemen	Hydrological	Flood	Flash flood	1	3400.0	13.0	NaN	NaN
10339	2022	South Africa	Hydrological	Flood	No Subtype	7	143119.0	562.0	3.164000e+09	NaN
10340	2022	Zambia	Hydrological	Flood	No Subtype	1	15000.0	3.0	NaN	NaN
10341	2022	Zimbabwe	Hydrological	Flood	No Subtype	1	0.0	0.0	NaN	NaN
10342	2022	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	3000.0	0.0	NaN	NaN

10342 rows × 10 columns

Fehlende Werte ersetzen#

Vorteile

Verwendung möglichst aller Daten
genaue Analyse der fehlenden Werte kann zu neuen Erkenntnissen führen

Nachteile

keine Standard Methodik
eventuelle Verfälschung der Daten

Duplikate#

.duplicated() findet Zeilen mit genau den gleichen Werte.

Mit drop_duplicates() können diese entfernt werden.

data.duplicated().sum()

data.drop_duplicates()

	Year	Country	Disaster Subroup	Disaster Type	Disaster Subtype	Total Events	Total Affected	Total Deaths	Total Damage (USD, original)	CPI
1	1900	Cabo Verde	Climatological	Drought	Drought	1	11000.0	11000.0	NaN	3.077091
2	1900	India	Climatological	Drought	Drought	1	1250000.0	1250000.0	NaN	3.077091
3	1900	Jamaica	Hydrological	Flood	No Subtype	1	300.0	300.0	NaN	3.077091
4	1900	Japan	Geophysical	Volcanic activity	Ash fall	1	30.0	30.0	NaN	3.077091
5	1900	Turkey	Geophysical	Earthquake	Ground movement	1	140.0	140.0	NaN	3.077091
...	...	...	...	...	...	...	...	...	...	...
10338	2022	Yemen	Hydrological	Flood	Flash flood	1	3400.0	13.0	NaN	NaN
10339	2022	South Africa	Hydrological	Flood	No Subtype	7	143119.0	562.0	3.164000e+09	NaN
10340	2022	Zambia	Hydrological	Flood	No Subtype	1	15000.0	3.0	NaN	NaN
10341	2022	Zimbabwe	Hydrological	Flood	No Subtype	1	0.0	0.0	NaN	NaN
10342	2022	Zimbabwe	Meteorological	Storm	Tropical cyclone	1	3000.0	0.0	NaN	NaN

10342 rows × 10 columns

Daten speichern & Neu laden#

data.to_csv('../../data/clean_emdat.csv')

Explorative Datenanalyse

Contents

Explorative Datenanalyse#

Ziele#

Was sind Daten?#

Strukturierte Daten#

Unstrukturierte Daten#

Semi-Strukturierte Daten#

Unterschiedliche Ebenen an Information in Daten#

Datenpipeline#

Explorative Datenanalyse: Weltweite Naturkatastrophen#

Idee, Daten finden & Verifikation#

Aggregated figures for Natural Disasters in EM-DAT#

Datenexploration und -bereinigung#

Überblick über die Daten#

Data Cleaning Teil I: erste Zeile im DataFrame entfernen#

Datentypen abfragen und anpassen#

Data Cleaning Teil II: Datentypen umwandeln#

Überblick über die numerischen Daten#

Überblick über die Objekt Daten#

Data Cleaning Teil III: Germany vereinheitlichen#

Weiter mit der Datenexploration#

Dataframes Sortieren#

Indexing and Retriving Data#

Boolean Indexing#

Fehlende Werte - NaNs Values#

Wie viele Werte pro Spalte sind NaN?#

Möglichkeiten zum Umgang mit fehlenden Werte#

Alle Zeilen mit NaN Werten entfernen#

Fehlende Werte entfernen#

Fehlende Werte ersetzen#

Fehlende Werte ersetzen#

Duplikate#

Daten speichern & Neu laden#

Data Cleaning Teil III: `Germany` vereinheitlichen#