Skip to main content

· 4 min read
Hugo Le Moine

I opened my twitter account @hugo_le_moine_ back in 2012 during my high school years, and have mostly used it to retweet/like content, using it like a bookmarking tool. Over time, I accumulated a couple hundreds tweets and I certainly don't remember >95% of them.

It's now time for a deep cleaning!

But... Twitter does not let you delete more than a tweet at time. Then how to automate this ?


Twitter API

Leveraging Twitter API is the first option I explored. Twitter has built quite a comprehensive set of endpoints enabling developer to:

  • Post and delete a tweet / retweet / like / quote / bookmark

  • Search tweets

  • Follow / unfollow

Applying for a developer account

Using Twitter API requires a developer account. The "Essential" level was enough for this tiny project. As I plan to try other functionalities later on, I got the "Elevated" level.

Creating an app and getting your keys and tokens

Before we can dive into the API and play with the endpoints, we need some more configuration: creating an app.

Note: keys generated cannot be displayed again, in case you forgot to write them somewhere, you can still regenerate them.

In the "Settings" tab, the "OAuth 1.0a" must be activated, with "Read and write" permissions. The callback URI / Redirect URL can be your own github page.

In the Keys and tokens tab, you can now get your consumer keys and generate authentication tokens for your personal account.


Tweepy

APIs can be used in most programming languages, and as Python is the one I am most comfortable with, I looked for a library that wraps the Twitter API for conveniency. I came across tweepy.

Credentials

Your credentials can be stored in a yaml file to avoid hard-coding them in your python files (do not commit your keys!). For example in delete-my-tweets.yaml:

delete-my-tweets-dev:
consumer_key: ***********
consumer_secret: **************
access_token: *************************
access_token_secret: **********************
bearer_token: ********************************************

Then, you can load them using pyyaml:

import yaml

# https://stackoverflow.com/a/1774043/13765085
with open("delete-my-tweets.yaml", "r") as stream:
try:
parameters = yaml.safe_load(stream)['delete-my-tweets-dev']
except yaml.YAMLError as exc:
print(exc)
Initialize client
import tweepy

client = tweepy.Client(
bearer_token=parameters['bearer_token'],
consumer_key=parameters['consumer_key'],
consumer_secret=parameters['consumer_secret'],
access_token=parameters['access_token'],
access_token_secret=parameters['access_token_secret'],
wait_on_rate_limit=True, # to automatically wait when rate is exceeded
)
Retrieve user's tweets

To get the list of tweets from a user, it is required to provide the id associated with the account. To get this information, you can use the method get_user and provide your username.

user = client.get_user(username='hugo_le_moine_')

We can now retrieve the account's tweets. The method get_user_tweets allows to collect tweets 100 at a time using pagination: each call will return a maximum of max_results tweets, and provide a token next_token allowing to collect the next tweets by passing it as an argument to the next call.

This method only allows to retrieve the most recent 3200 tweets.

We can write a recursive function that will go dig until the no more next_token is provided, and return all the tweets as a list.

def collect_tweet_ids(next_token=None):
# timeline contains tweets and metadata about the user
timeline = client.get_users_tweets(
id=user.data.id,
max_results=100,
pagination_token=next_token
)

# get list of tweets
tweet_list= timeline.data

# if a next page exists, append its results to the list of tweets
if 'next_token' in timeline.meta:
tweet_list += collect_tweet_ids(timeline.meta['next_token'])
return tweet_list

We can now collect the tweets history:

tweets = collect_tweet_ids()
Delete tweets

Finally, we can use the tweet ids to delete them using the delete_tweet method.

Warning: twitter cap rate limits tweet deletion to 50 per 15 min window. Deleting all tweets can take a while, more than 8 hours in my case.

I use tqdm to display a progress bar.

from tqdm import tqdm

for tweet in tqdm(tweets):
client.delete_tweet(tweet.id)

That's it!! I can now resume using twitter with a seemingly blank account, except that I get to keep all my settings and followings. I learnt to use the Twitter API and it's python wrapper tweepy, not in depth, but I will quite likely get back to it in the future!

Has it been useful to you? Any faster way that you found? Any feedback will be appreciated, hit me up on twitter @hugo_le_moine_!

· 8 min read
Hugo Le Moine

Map of found items

Following a previous article analyzing monthly evolution as well as concentration of found items, here is another perspective on these data made available by the SNCF. Obviously, the higher the number of passengers, the higher the number of found items. Now, let's look into the details!

On this map are represented the number of found items per 100k passengers between 2015 and 2018. Only stations with more than 10 found items are shown.

What can we conclude from this visualization ?

Clearly, there is a significant difference between the area of Paris/north of France, and anywhere else in the country. Nonetheless, one should avoid easy conclusion: are people more careful about their belongings or less prone to hand found items back ? The two options are possibly and even likely connected, but additional data are required to investigate. Perhaps in another part :)

Data

Libraries

The following libraries are imported:

  • pandas and numpy for data processing
  • plotly.colors to use a specific colorscale
  • plotly.graph_object for data visualization
import pandas as pd						
import numpy as np
import plotly.colors
import plotly.graph_objects as go

Processing

1. Reading csv files

df_frequentation = pd.read_csv('data/frequentation-gares.csv', sep=';')
df_gares = pd.read_csv('data/referentiel-gares-voyageurs.csv', sep=';')
df_items = pd.read_csv('data/objets-trouves-restitution.csv', sep=';')

These two dataframes contain contain the following data:

  • df_frequentation: train station attendance data

  • df_gares: train station data, including latitude and longitude.

  • df_items: found item data, including date and location.

Sample data from df_frequentation

Nom de la gareCode UIC completCode postalSegmentation DRG 2018Total Voyageurs 2018Total Voyageurs + Non voyageurs 2018Total Voyageurs 2017Total Voyageurs + Non voyageurs 2017Total Voyageurs 2016Total Voyageurs + Non voyageurs 2016Total Voyageurs 2015Total Voyageurs + Non voyageurs 2015
0Abancourt8731375960220c402284022843760437604109641096.5516143972039720
1Agay8775755983530c150931509314154141541924019240.5143701912119121
2Agde8778127834300a588297735372697091871364660656825820.929253662516828146
3Agonac8759515724460c149214921583158311341134.69999611271127
4Aigrefeuille Le Thou8748519317290c18670186701451314513266266.15714400

Sample data from df_gares

Code plate-formeIntitulé gareIntitulé fronton de gareGare DRGGare étrangèreAgence gareRégion SNCFUnité gareUTNbre plateformes...Longitude WGS84Latitude WGS84Code UICTVSSegment DRGNiveau de serviceSOPRGDate fin validité plateformeWGS 84
000007-1Bourg-MadameBourg-MadameTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOURG MADAME GARE1...1.94867042.43240787784876BMDc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4324069,1.9486704
100014-1Bolquère - EyneBolquère - EyneTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOLQUERE EYNE GARE1...2.08755942.49787387784801BQEc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4978734,2.0875591
200015-1Mont-Louis - La CabanasseMont-Louis - La CabanasseTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonMONT LOUIS LA CABANASSE GARE1...2.11313842.50209087784793MTCc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.5020902,2.1131379

Sampel data from df_items

DateDate et heure de restitutionGareCode UICNature d'objetsType d'objetsType d'enregistrement
02014-03-09T14:25:29+01:00NaNParis Montparnasse87391003.0Manteau, veste, blazer, parka, blouson, capeVêtements, chaussuresObjet trouvé
12018-01-23T15:07:32+01:00NaNSaint-Étienne Châteaucreux87726000.0MontreBijoux, montresObjet trouvé
22018-02-06T15:35:49+01:00NaNRennes87471003.0Clés, porte-clésClés, porte-clés, badge magnétiqueObjet trouvé

2. Merging dataframes

The three dataframes have a column identifier: the UIC code ("Union Internationale des Chemins de fer", International Union of Railways in french). We can merge them based on this unique value for each station.

df = df_gares.merge(
right=df_frequentation,
left_on='Code UIC',
right_on='Code UIC complet',
how='inner')

df = df.merge(df_items, on='Code UIC', how='right')

3. Conversion to datetime

For now, the DataFrame has a Date column, but was not assigned any particular format.

df.Date.dtype
>> dtype('O')

In order to have a standardized datetime, pandas.to_datetime can be applied to get a datetime, and then tz_convert to make it french time.

df['Date'] = pd.to_datetime(df.Date, utc=True).dt.tz_convert('Europe/Paris')

4. Filtering dates

As I want to analyze the number of found items according to the number of passengers, I must filter out data which is not in the period 2015-2018 as these are the only years for which the number of passengers is available.

df = df.loc[(df.Date.dt.year >= 2015) & (df.Date.dt.year <= 2018)]

5. Total number of passengers per station

df['Passengers'] =(df['Total Voyageurs 2018'] 
+ df['Total Voyageurs 2017']
+ df['Total Voyageurs 2016']
+ df['Total Voyageurs 2015'])

6. Grouping data by train station

Data is grouped by station name, geographical coordinates are kept, as well as the count of found items, and the number of passengers.

df = (df.groupby(by="Intitulé gare")
.agg({'Longitude WGS84': 'first',
'Latitude WGS84': 'first',
'Nature d\'objets': len,
'Passengers': 'first'})
.rename(columns={'Nature d\'objets': 'item_count'})
)

7. Count per 100k

In this blog post, the focus is put on the ratio item/pax. A straightforward division would lead to number hard to imagine. Therefore, using a ratio per 100k passengers is more interesting. I restricted the selection to stations having at least 10 found items .

df['count_per_100kpax'] = df.item_count / df.Passengers * 1e5
df = df.sort_values(by='count_per_100kpax', ascending=False)
df = df[df.item_count >= 10]

8. Creation of categories

To group data by category on the map, we need to bin them. I chose a standard quantile binning with 20% of data in each bin.

df['group'] = pd.qcut(df['count_per_100kpax'], q=np.linspace(0, 1, 6))

9. What we have so far

Intitulé gareLongitude WGS84Latitude WGS84item_countPassengerscount_per_100kpaxcategory
Versailles Rive Droite2.13475248.8096533921309519.00.183017(0.182, 3.623]
Poissy2.04136848.9329018043120958.00.185525(0.182, 3.623]
.....................
Hendaye-1.78172443.35313228731422622.0201.951045(24.664, 323.36]
Le Croisic-2.50744247.2898361358419966.0323.359510(24.664, 323.36]

Visualization

Now that all the processing part is completed, we can proceed to the visualization.

fig = go.Figure()
colors = plotly.colors.sequential.Inferno # Custom colormap

# One scattermapbox per group
for i, group in enumerate(df.group.cat.categories):
df_sub = df[df.group == group]
fig.add_trace(go.Scattermapbox(
lat=df_sub['Latitude WGS84'],
lon=df_sub['Longitude WGS84'],
text=df_sub.index,
marker=dict(
color=colors[2*i],
size=df_sub['count_per_100kpax'],
sizemin=3,
sizeref=.35,
sizemode='area',
opacity=.8,
),
meta=df_sub['item_count'],
hovertemplate="%{text}" + "<br>"
+ "Found items: %{meta}" + "<br>"
+ "Per 100kPax: " + "%{marker.size:.1f}",
name=f'> {cat.left:.0f} per 100kPax',
))
# Defining map style, margins, and original position
fig.update_layout(
mapbox_style="open-street-map",
#title='Passengers per french train station in 2018',
margin={'l': 0, 'r': 0, 't': 0, 'b': 0},
mapbox=dict(
center={'lon': 2.39, 'lat': 47.09},
zoom=4
),
)
# Legend layout
fig.update_layout(legend={'orientation': 'h', 'y': 0})

Link to the Jupyter notebook.

· One min read
Hugo Le Moine

Based on user searches, Google Flu tried to estimate flu prevalence among populations. In this simple visualization I try to use a similar, yet simplified approach. Below are compared USA reported covid19 cases against Google Trends popularity for "Coronavirus" and "Worldometer" (which is a website used worldwide to track this pandemic).

Timeline

  • January 2020: quick spread in China
  • Mid-February : spread to Europe
  • "Coronavirus" popularity peaks on 15th March, when USA cases growth rate is maximal.
  • "Worldometer" popularity peaks two weeks later.
  • 11th April: USA 7-day moving average peaks.
  • 31th May: USA 7-day moving average bottoms to a 2-month low.
  • Since June: cases are up to new daily records, but popularity of "Coronavirus" remains low.

Data

Link to the Jupyter notebook.

· 7 min read
Hugo Le Moine

Monthly found items in french train stations

We can view found items as a proxy for passenger traffic. On this first chart, we can clearly identify a few traffic disruptions:

  • April 2018 to June 2018 : discontinuous strike against project to reform SNCF.
  • December 2019 : strike against pension reform.
  • March 2020 to June 2020 : Covid-19 restrictions.

We can also observe a peak in July each year, corresponding to summer vacations for most people.

18% of stations represent 80% of found items

Data

Libraries

The following libraries are imported:

  • pandas and numpy for data processing
  • plotly.colors to use a specific colorscale
  • plotly.graph_object for data visualization
import pandas as pd						
import numpy as np
import plotly.colors
import plotly.graph_objects as go

Processing

1. Reading csv files

df_gares = pd.read_csv('data/referentiel-gares-voyageurs.csv', sep=';')
df_items = pd.read_csv('data/objets-trouves-restitution.csv', sep=';')

These two dataframes contain contain the following data:

  • df_gares: train station data, including latitude and longitude.
  • df_items: found item data, including date and location.

Sample data from df_gares

Code plate-formeIntitulé gareIntitulé fronton de gareGare DRGGare étrangèreAgence gareRégion SNCFUnité gareUTNbre plateformes...Longitude WGS84Latitude WGS84Code UICTVSSegment DRGNiveau de serviceSOPRGDate fin validité plateformeWGS 84
000007-1Bourg-MadameBourg-MadameTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOURG MADAME GARE1...1.94867042.43240787784876BMDc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4324069,1.9486704
100014-1Bolquère - EyneBolquère - EyneTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOLQUERE EYNE GARE1...2.08755942.49787387784801BQEc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4978734,2.0875591
200015-1Mont-Louis - La CabanasseMont-Louis - La CabanasseTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonMONT LOUIS LA CABANASSE GARE1...2.11313842.50209087784793MTCc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.5020902,2.1131379

Sampel data from df_items

DateDate et heure de restitutionGareCode UICNature d'objetsType d'objetsType d'enregistrement
02014-03-09T14:25:29+01:00NaNParis Montparnasse87391003.0Manteau, veste, blazer, parka, blouson, capeVêtements, chaussuresObjet trouvé
12018-01-23T15:07:32+01:00NaNSaint-Étienne Châteaucreux87726000.0MontreBijoux, montresObjet trouvé
22018-02-06T15:35:49+01:00NaNRennes87471003.0Clés, porte-clésClés, porte-clés, badge magnétiqueObjet trouvé

2. Conversion to datetime

For now, the DataFrame has a Date column, but was not assigned any particular format.

df.Date.dtype
>> dtype('O')

In order to have a standardized datetime, pandas.to_datetimecan be applied to get a datetime, and then tz_convertto make it french time.

df['Date'] = pd.to_datetime(df.Date, utc=True).dt.tz_convert('Europe/Paris')

3. Filtering dates

To know when data collection has actually been put into place, let's analyse dates contained in the DataFrame.

df.Date.dt.year.value_counts().sort_index()
DateCount
201315698
2014102793
2015111706
2016107595
2017107662
2018116308
2019122637
202026143

To keep consistency, I decided to drop data from 2013: data collection may have not started everywhere, and had definitely not started in January 2013. If we want to dig deeper into the data and check correlations with passenger count, it would be difficult to achieve if data are not complete. Therefore only data collected from 2014 onwards will be kept:

df = df[df.Date.dt.year >= 2014]

5. Grouping data by year and month

In the first chart, I wanted to plot found item count for each month in the dataset. To do so, we need to group by year, then month.

df_months = df.groupby(by=[df.Date.dt.year, df.Date.dt.month]).Date.count()

The result is the following:

2014  1       5788
2 5815
3 7012
4 8543
5 8657
...
2020 2 9271
3 4473
4 72
5 1386
6 3649

Now we need to get the index back to datetime, combining years and months.

df_months.index = df_months.index.map(lambda t: pd.to_datetime('-'.join(map(str, t))))

6. Grouping data by train station

The second chart was made after grouping by train station. It can be achieved by the following instructions:

df_grouped = (df
.groupby(by='Gare')
.count()
.Date
.rename('Count')
.sort_values(ascending=False)
)

This will first group by train station name, then apply a count aggregation, keep the Date renamed as Count, and values will be sorted from greatest to lowest count.

7. Pareto

The df_groupedDataFrame allows us to demonstrate an application of the Pareto priciple. The following lines are doing:

  • Cumulative found item count of df_grouped
  • Sum of all items in df_grouped
  • Assigning booleans to cumulative counts below 80% of the total number of items, and counting them
pareto = (df_grouped.cumsum() < df_grouped.sum() * 0.8).sum()

The df_groupedDataFrame contains 181 train stations, and 32 (18%) of them account for 80% of the total count of found items.

8. Filtering grouped data

Using the Pareto principle described above, I filtered df_groupedto keep the first 32 rows for visualization purpose.

df_grouped = df_grouped.iloc[:pareto]

Visualization

Now that all the processing part is completed, we can proceed to visualization.

1. Monthly found items

The first chart is simply a bar chart of the total count of found items per month. x will therefore be the index of the dataframe df_months, and y will be the sum of all found items found for the associated months.

fig = go.Figure()
fig.add_trace(
go.Bar(
x=df_months.index,
y=df_months,
hoverinfo="x+y",
marker_color="#007bff",
))

Then, y axis title is added, as well as margins to enhance the result.

fig.update_layout(
yaxis=dict(title="Found items in train stations", titlefont=dict(size=16)),
margin={'l': 30, 'r': 30, 't': 50, 'b': 0},
)

2. Found items per train station

The second chart is also a bar chart. The data plotted are the counts of found items in each train station. As mentioned in the processing part, only the first 32 stations are displayed, ordered by number of found items.

fig.add_trace(
go.Bar(
x=np.arange(1, len(df_grouped)),
y=df_grouped,
text=df_grouped.index,
meta = df_grouped,
hoverinfo="text+y",
name="Found items",
marker_color="#007bff"
))

The final step is adjusting the layout: margins, axis titles and log scale for the y axis.


fig.update_layout(
margin={'l':20, 'r': 0, 't': 0, 'b': 250},
xaxis=dict(tickangle=-90,
tickfont={'size': 14}),
yaxis=dict(title="Found items per station",
type="log",
titlefont=dict(size=16))
)

Link to the Jupyter notebook.

· 5 min read
Hugo Le Moine

Passengers per french train station in 2018

Data

Libraries

The following libraries are imported:

  • pandas and numpy for data processing
  • plotly.colors to use a specific color scale
  • plotly.graph_object for data visualization
import pandas as pd						
import numpy as np
import plotly.colors
import plotly.graph_objects as go

Processing

1. Reading csv files

df_frequentation = pd.read_csv('data/frequentation-gares.csv', sep=';')
df_gares = pd.read_csv('data/referentiel-gares-voyageurs.csv', sep=';')

Sample data from df_frequentation

Nom de la gareCode UIC completCode postalSegmentation DRG 2018Total Voyageurs 2018Total Voyageurs + Non voyageurs 2018Total Voyageurs 2017Total Voyageurs + Non voyageurs 2017Total Voyageurs 2016Total Voyageurs + Non voyageurs 2016Total Voyageurs 2015Total Voyageurs + Non voyageurs 2015
0Abancourt8731375960220c402284022843760437604109641096.5516143972039720
1Agay8775755983530c150931509314154141541924019240.5143701912119121
2Agde8778127834300a588297735372697091871364660656825820.929253662516828146
3Agonac8759515724460c149214921583158311341134.69999611271127
4Aigrefeuille Le Thou8748519317290c18670186701451314513266266.15714400

Sample data from df_gares

Code plate-formeIntitulé gareIntitulé fronton de gareGare DRGGare étrangèreAgence gareRégion SNCFUnité gareUTNbre plateformes...Longitude WGS84Latitude WGS84Code UICTVSSegment DRGNiveau de serviceSOPRGDate fin validité plateformeWGS 84
000007-1Bourg-MadameBourg-MadameTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOURG MADAME GARE1...1.94867042.43240787784876BMDc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4324069,1.9486704
100014-1Bolquère - EyneBolquère - EyneTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonBOLQUERE EYNE GARE1...2.08755942.49787387784801BQEc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.4978734,2.0875591
200015-1Mont-Louis - La CabanasseMont-Louis - La CabanasseTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonMONT LOUIS LA CABANASSE GARE1...2.11313842.50209087784793MTCc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.5020902,2.1131379
300020-1Thuès les BainsThuès les BainsTrueFalseAgence Grand SudREGION LANGUEDOC-ROUSSILLONUG Languedoc RoussillonTHUES LES BAINS GARE1...2.24909442.52880187784744THBc1.0NaNGARES C LANGUEDOC ROUSSILLONNaN42.5288009,2.249094

2. Merging dataframes

The UIC Code is a unique ID for train stations. However, the column names are different in both files, so it's mandatory so specify the left_on and right_on arguments.

df = df_gares.merge(
right=df_frequentation,
left_on='Code UIC',
right_on='Code UIC complet',
how='inner')

3. Filtering

In order to avoid keeping small train stations, I chose to filter out stations with attendance below 1000 passengers in 2018. For visualization purpose, I added a column holding the square root of the number of passengers per station

df = df[df['Total Voyageurs 2018'] > 1000]

4. Adding a category column

By using pandas.cut data can be split into categories according to total number of passengers. This will allow to plot with a different color for each category.

df['category'] = pd.cut(df['Total Voyageurs 2018'], bins=[1e4, 1e5, 1e6, 1e7, np.inf])

Visualization

Plotly is a handy tool when it comes to creating interactive graphs and plots, that you can embed in other websites.

1. Scatter Mapbox

Data contain latitude and longitude: these will be used to plot train stations on the map. The size of the bubbles will depend on the square root of the number of passengers in 2018. A different trace is added for each of the categories defined above. Finally, information shown on mouse-hovering is defined using hovertemplate.

fig = go.Figure()
colors = plotly.colors.sequential.Viridis

for i, cat in enumerate(df.category.cat.categories):
df_sub = df[df.category == cat]
fig.add_trace(go.Scattermapbox(
lat=df_sub['Latitude WGS84'],
lon=df_sub['Longitude WGS84'],
text=df_sub['Intitulé gare'],
marker=dict(
color=colors[2*i+1],
size=np.sqrt(df_sub['Total Voyageurs 2018 sqrt']),
sizemin=1,
sizeref=15,
sizemode='area',
opacity=.8,
),
meta=df_sub['Total Voyageurs 2018'],
hovertemplate="%{text}" + "<br>" + "Passengers: %{meta}",
name=f'> {cat.left:1.0e} passengers',
))

2. Layout

The last step is adding the background map, the title, margins around the plot, and the initial position & zoom.

fig.update_layout(
mapbox_style="open-street-map",
title='Passengers per french train station in 2018',
margin={'l': 0, 'r': 0, 't': 50, 'b': 0},
mapbox=dict(
center={'lon': 2.39, 'lat': 47.09},
zoom=4
),
)

Link to the Jupyter notebook.