Skip to main content

SNCF Open Data: found items - 2

· 8 min read
Hugo Le Moine

Map of found items

Following a previous article analyzing monthly evolution as well as concentration of found items, here is another perspective on these data made available by the SNCF. Obviously, the higher the number of passengers, the higher the number of found items. Now, let's look into the details!

On this map are represented the number of found items per 100k passengers between 2015 and 2018. Only stations with more than 10 found items are shown.

What can we conclude from this visualization ?

Clearly, there is a significant difference between the area of Paris/north of France, and anywhere else in the country. Nonetheless, one should avoid easy conclusion: are people more careful about their belongings or less prone to hand found items back ? The two options are possibly and even likely connected, but additional data are required to investigate. Perhaps in another part :)

Data

Libraries

The following libraries are imported:

  • pandas and numpy for data processing
  • plotly.colors to use a specific colorscale
  • plotly.graph_object for data visualization
import pandas as pd						
import numpy as np
import plotly.colors
import plotly.graph_objects as go

Processing

1. Reading csv files

df_frequentation = pd.read_csv('data/frequentation-gares.csv', sep=';')
df_gares = pd.read_csv('data/referentiel-gares-voyageurs.csv', sep=';')
df_items = pd.read_csv('data/objets-trouves-restitution.csv', sep=';')

These two dataframes contain contain the following data:

  • df_frequentation: train station attendance data

  • df_gares: train station data, including latitude and longitude.

  • df_items: found item data, including date and location.

Sample data from df_frequentation

Nom de la gareCode UIC completCode postalSegmentation DRG 2018Total Voyageurs 2018Total Voyageurs + Non voyageurs 2018Total Voyageurs 2017Total Voyageurs + Non voyageurs 2017Total Voyageurs 2016Total Voyageurs + Non voyageurs 2016Total Voyageurs 2015Total Voyageurs + Non voyageurs 2015
0Abancourt8731375960220c402284022843760437604109641096.5516143972039720
1Agay8775755983530c150931509314154141541924019240.5143701912119121
2Agde8778127834300a588297735372697091871364660656825820.929253662516828146
3Agonac8759515724460c149214921583158311341134.69999611271127
4Aigrefeuille Le Thou8748519317290c18670186701451314513266266.15714400

Sample data from df_gares

Code plate-formeIntitulé gareIntitulé fronton de gareGare DRGGare étrangèreAgence gareRégion SNCFUnité gareUTNbre plateformes...Longitude WGS84Latitude WGS84Code UICTVSSegment DRGNiveau de serviceSOPRGDate fin validité plateformeWGS 84
000007-1Bourg-MadameBourg-MadameTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOURG MADAME GARE1...1.94867042.43240787784876BMDc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4324069,1.9486704
100014-1Bolquère - EyneBolquère - EyneTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOLQUERE EYNE GARE1...2.08755942.49787387784801BQEc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4978734,2.0875591
200015-1Mont-Louis - La CabanasseMont-Louis - La CabanasseTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonMONT LOUIS LA CABANASSE GARE1...2.11313842.50209087784793MTCc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.5020902,2.1131379

Sampel data from df_items

DateDate et heure de restitutionGareCode UICNature d'objetsType d'objetsType d'enregistrement
02014-03-09T14:25:29+01:00NaNParis Montparnasse87391003.0Manteau, veste, blazer, parka, blouson, capeVêtements, chaussuresObjet trouvé
12018-01-23T15:07:32+01:00NaNSaint-Étienne Châteaucreux87726000.0MontreBijoux, montresObjet trouvé
22018-02-06T15:35:49+01:00NaNRennes87471003.0Clés, porte-clésClés, porte-clés, badge magnétiqueObjet trouvé

2. Merging dataframes

The three dataframes have a column identifier: the UIC code ("Union Internationale des Chemins de fer", International Union of Railways in french). We can merge them based on this unique value for each station.

df = df_gares.merge(
right=df_frequentation,
left_on='Code UIC',
right_on='Code UIC complet',
how='inner')

df = df.merge(df_items, on='Code UIC', how='right')

3. Conversion to datetime

For now, the DataFrame has a Date column, but was not assigned any particular format.

df.Date.dtype
>> dtype('O')

In order to have a standardized datetime, pandas.to_datetime can be applied to get a datetime, and then tz_convert to make it french time.

df['Date'] = pd.to_datetime(df.Date, utc=True).dt.tz_convert('Europe/Paris')

4. Filtering dates

As I want to analyze the number of found items according to the number of passengers, I must filter out data which is not in the period 2015-2018 as these are the only years for which the number of passengers is available.

df = df.loc[(df.Date.dt.year >= 2015) & (df.Date.dt.year <= 2018)]

5. Total number of passengers per station

df['Passengers'] =(df['Total Voyageurs 2018'] 
+ df['Total Voyageurs 2017']
+ df['Total Voyageurs 2016']
+ df['Total Voyageurs 2015'])

6. Grouping data by train station

Data is grouped by station name, geographical coordinates are kept, as well as the count of found items, and the number of passengers.

df = (df.groupby(by="Intitulé gare")
.agg({'Longitude WGS84': 'first',
'Latitude WGS84': 'first',
'Nature d\'objets': len,
'Passengers': 'first'})
.rename(columns={'Nature d\'objets': 'item_count'})
)

7. Count per 100k

In this blog post, the focus is put on the ratio item/pax. A straightforward division would lead to number hard to imagine. Therefore, using a ratio per 100k passengers is more interesting. I restricted the selection to stations having at least 10 found items .

df['count_per_100kpax'] = df.item_count / df.Passengers * 1e5
df = df.sort_values(by='count_per_100kpax', ascending=False)
df = df[df.item_count >= 10]

8. Creation of categories

To group data by category on the map, we need to bin them. I chose a standard quantile binning with 20% of data in each bin.

df['group'] = pd.qcut(df['count_per_100kpax'], q=np.linspace(0, 1, 6))

9. What we have so far

Intitulé gareLongitude WGS84Latitude WGS84item_countPassengerscount_per_100kpaxcategory
Versailles Rive Droite2.13475248.8096533921309519.00.183017(0.182, 3.623]
Poissy2.04136848.9329018043120958.00.185525(0.182, 3.623]
.....................
Hendaye-1.78172443.35313228731422622.0201.951045(24.664, 323.36]
Le Croisic-2.50744247.2898361358419966.0323.359510(24.664, 323.36]

Visualization

Now that all the processing part is completed, we can proceed to the visualization.

fig = go.Figure()
colors = plotly.colors.sequential.Inferno # Custom colormap

# One scattermapbox per group
for i, group in enumerate(df.group.cat.categories):
df_sub = df[df.group == group]
fig.add_trace(go.Scattermapbox(
lat=df_sub['Latitude WGS84'],
lon=df_sub['Longitude WGS84'],
text=df_sub.index,
marker=dict(
color=colors[2*i],
size=df_sub['count_per_100kpax'],
sizemin=3,
sizeref=.35,
sizemode='area',
opacity=.8,
),
meta=df_sub['item_count'],
hovertemplate="%{text}" + "<br>"
+ "Found items: %{meta}" + "<br>"
+ "Per 100kPax: " + "%{marker.size:.1f}",
name=f'> {cat.left:.0f} per 100kPax',
))
# Defining map style, margins, and original position
fig.update_layout(
mapbox_style="open-street-map",
#title='Passengers per french train station in 2018',
margin={'l': 0, 'r': 0, 't': 0, 'b': 0},
mapbox=dict(
center={'lon': 2.39, 'lat': 47.09},
zoom=4
),
)
# Legend layout
fig.update_layout(legend={'orientation': 'h', 'y': 0})

Link to the Jupyter notebook.