I Tried Getting My Data Out of Facebook Before Quitting. I Even Wrote Code. It Didn't Go Well.

Depending on which social circles you run in, you may believe that the exodus from social media has begun, or, on the contrary, that such an exodus is impossible. One can't deny, however, that the past year has seen a tidal shift in the way all of us perceive, value, and participate in online social networks.

From numerous privacy and propaganda scandals to the recent day-long downtime, Facebook has been giving people reasons to question if not downright reject being on its platform. The most recent revelation that the social networking company was storing, in clear text, the passwords to millions of Facebook and Instagram user accounts caused further erosion to confidence in Facebook's security practices. Perhaps necessitated by the closure of its failed social platform Google+, Google has become the first major tech player to volunteer a path for users to take their data out of a social platform in an interoperable format. An entire movement known as the decentralized social web has emerged, and the World Wide Web's creator Tim Berners Lee is advocating for a new approach to personal data ownership and interoperability called Solid.

Despite the growing landscape of alternatives to the dominant, closed social platforms (see part 3 of this series: Self-Hosted Personal Data is Key to Four Promising Facebook Alternatives), the problem remains: how can an average user actually leave these platforms with their data not only intact but also useful? For example, useful in an alternative social network?

Perhaps giving its users a false sense of security, Facebook advertises this so-called benefit under the "Download Your Information" section of its user settings menu. At the time this article was published, the text of that section clearly suggests (see screenshot below) that you can download your data and take it to another service:

"You can download a copy of your Facebook information at any time. You can download all of it at once, or you can select only the types of information and date ranges you want. You can choose to receive your information in an HTML format that is easy to view, or a JSON format, which could allow another service to more easily import it.

The capability to download a copy of your data and export it to another service is both explicit and implicit in wording of Facebook's user interface.
The capability to download a copy of your data and import it into another service is explicitly advertised in Facebook's user interface.

When it comes to Facebook, existing articles and "how to" guides on the subject primarily focus on the behavioral or cultural aspect of leaving, reassuring you that there is life after the endless scroll and omnipresent Like buttons. Unfortunately, this "quit and forget" approach isn't especially helpful for those of us who want to keep our data as if it were a scrapbook of memories, or who aren't ready to give up on social networks as a whole and simply wish to move elsewhere.

Like with Google+, the various social networks are likely to give you a way to download your personal information as a final gesture before sunsetting your account. However, while it's one thing to be able to download your data for personal safekeeping or storing it in an alternative service that probably doesn't exist yet, it's an entirely different challenge to maintain the logical interconnectedness of that data as you experienced it as a user on Facebook, Twitter, or LinkedIn: in other words, the context behind, and relationships between, your photos, events, status updates, links, contacts, groups, and more. Among social network experts, this labyrinth of interconnected social data is referred to as your "social graph."

An example of the challenge that lies before us: if you're able to download a Facebook photo that's tagged with the names of friends who are in the photo, how can that social graph of data be preserved in a way that's useful in some other non-Facebook context (on your PC, smartphone, tablet, or in some other online service)?

At ProgrammableWeb, we wanted to figure out what it takes from a technical perspective to pack up and leave the most closed platform, Facebook, and take your social graph somewhere else. In other words, extract your most precious data (your friends, your photos, your history, etc.) in a way that it's wholly interoperable with some other Facebook substitute (perhaps one of the emerging decentralized social networks). Or, with anything for that matter.

No Easy Way Out

In our attempt to successfully migrate our data out of a Facebook account to something else, we considered three possible solutions in order of obvious preference for the general consumer:

  1. Get our data from Facebook and then upload the downloaded files to a different social network or data store.
  2. Short of option 1, get our data from Facebook and hope there was a third-party application or cloud service that could migrate our data to a different social network or data store for us.
  3. Write our own code to migrate the data.

The third option was the least desirable: it would be the most work, and it would take more resources to be ready or useful to a user outside of our team. Before going that route, we investigated what is possible already and what others have already built.

Every user on Facebook has the option to request the data they've created on Facebook from their account settings. As can be seen in the dropdown menu in the screenshot below, they can select either HTML or JSON as the output format, for any given date range, and with Low, Medium, or High media quality. Facebook says it can take up to three days to aggregate this data and email it to you as a ZIP file. In our tests, however, it took less than one. If you selected HTML as the output format, the ZIP file contains HTML files that you can open in your web browser, including an index.htm file that serves as a table of contents for the other files. These pages simply look like a much pared-down version of Facebook itself. The ZIP file also contains all of your images, which, along with the list of ad topics they've assigned to you, are probably the only interesting or useful contents of this file.

image 02
When requesting a download of their data, Facebook users can choose between the HTML and JSON formats

The fundamental problem with the emailed ZIP file — with HTML or JSON — is the lack of links or context. Not only is this file marginally useful for people who want to quit but not forget, it's entirely unuseful for anyone who wants to take their data elsewhere. With the HTML download, even something as simple as uploading those photos along with their descriptions to a Google Photos album is nigh on impossible. The following screenshot illustrates what sort of "data" the ZIP file of HTML files contains about your timeline activity:

image 03
While Facebook allows you to download your data, the downloads are missing some important context

If this is all you get, the loss of important context is breathtaking. What are 45 people interested in? Why is one item's content simply "Aaron and Desiree"? What has been shared to Boulder Hubbers? Where are the related links and pictures? This isn't even useful for a trip down memory lane, much less moving to another platform.

Alternatively, if you selected JSON as the output format, the ZIP file contains JSON files organized into several folders. There is no index.json or other table of contents, and the JSON data is not linked together nor does it contain useful metadata such as privacy settings (a terribly important context that's alternatively available through Facebook's API). Take for example this comment item from my comments.json file, which contains an array of all comments I have ever left on my own posts, on other people's posts, or in groups:

{
  	"timestamp": 1553285055,
  	"data": [
    	{
      	"comment": {
        "timestamp": 1553285055,
        "comment": "Ah! All of that beer on the left is so much more expensive in the States!",
        "author": "Shelby Switzer"
      	}
    	}
   	],
   	"title": "Shelby Switzer commented on Ira Smith's photo."
}

There are no IDs or links in this JSON. I, therefore, have no straightforward way of finding the photo that I left this comment on. I could probably write some code to parse the friend's name out of the title, which would help me find the friend who posted the photo, and maybe I could use the timestamps to piece together context using other posts for reference. But the trail stops there. In the posts.json file, it's possible to piece together certain information such as others' comments on your own status updates because this information is presented as deeply nested JSON structures. Beyond that, links within the data or to the data's location on a Facebook webpage are missing.

If we look outside of Facebook to see what other tech giants are doing, we see similar offerings. With Google's Takeout tool for your downloaded Google+ social data, you're also given the option to download your data as CSV or JSON. With Twitter, users are unable to select an output format, and the only information they provide about what data will be sent to you is that it's data "we believe is most relevant and useful to you." It turns out to be HTML data along with a CSV of all of your tweets.

image 04
Google's Takeout Tool

image 05
Twitter's Utility for downloading your data

Interestingly, both Google and Facebook, along with Twitter and Microsoft, announced last year an open partnership called the Data Transfer Project to support interoperability between platforms. The Data Transfer Project is still in its early stages: its architecture and implementation are still being defined, and the only functionality worked out so far for Facebook is related to exporting photos. This is no surprise, given Facebook's long history of data protectionism. Short of any US or European-backed government regulation that forces Facebook's hand, Facebook has no real interest in helping users leave or even make sense of the data they can download.

Apart from the JSON and HTML download options, the last remaining Facebook-supported possibility for extracting something remotely useful is through Facebook's Graph API (technically, a webscraper might work, but Facebook's Terms of Service prohibits their usage). However, the API option by itself is entirely unsuitable to everyday Facebook users. Like all APIs, Facebook's Graph API is intended for use with your own custom-written software (or such an offering from a third party software developer). While the downloadable ZIP file contains some metadata that's incredibly interesting, the data that you'd really want to port over to another service is mainly accessible via this API.

We searched the web to see if any such tools already existed and while we found some open source projects similar to what we were trying to achieve, they either haven't been updated in years (such as one Wordpress plugin to import some social media data to a single Wordpress site), or they can't be used for Facebook data. Some companies and open source projects are beginning to surface to help with this problem, such as Cozy, an established open source personal data cloud. At the time of publication, however, Cozy did not integrate with Facebook. The Solid project, which also aims to help users store data in their own personal data clouds, known as Solid PODs, does not yet have any social data support. The only Solid tool available so far is this Google Takeout importer, which does not appear to be functional yet. Another service we found, Stream.io, appears to want to integrate with Facebook. But, similarly, it was not yet launched at the time of publication of this article.

image 06
An interesting but coy message on stream.io's home page

Ultimately, we couldn't use any of these tools to get our data out of Facebook and make it useable.

Left with no other choice, we decided to build our own.

Our Approach

First, we defined the minimally viable product: a command line tool that, using Facebook's Graph API, pulls a core set of data from Facebook, stores it locally in a useable format (JSON), and transforms that data into an interoperable format using currently available data standards. We decided that our core data set should include a user's profile information, posts, activity feed, albums, and photos.

Stretch goals for the project included hosting it as a web app, pulling other data, launching a turnkey Facebook-like app that anyone can use, and connecting the app for data upload to alternative social platforms like Mastodon.

The resulting minimally viable product (MVP), called Salvager, is built in Ruby but can run in a Docker container. You, therefore, don't need to know or install Ruby to use it. All you need to use the app currently is a Facebook developer portal account and the ability to run Docker. We have open sourced the code under GPL 3.0 and you are free to download either the source code or the Docker container from the Salvager repo on ProgrammableWeb's Github (be sure to let us know about any forks!). 

Interoperability

To make the data interoperable in as non-proprietary a way as possible, we looked to the W3C Social Web Protocols, developed and published by the W3C Social Web Working Group between June 2014 to December 2017. While the official working group is no longer active, there is a W3C community group that has carried the work forward and some of the new decentralized social network projects like Mastadon support these protocols in their APIs for the purpose of importing and exporting data. This gave us a glimmer of hope that if we could export our data out of Facebook via API, translate it to the W3C format, and then import it to somewhere else like Mastodon.

Of the W3C's Social Web Protocols, the core protocol is ActivityPub, which is a client-server API protocol for social networks using the ActivityStreams 2.0 data format. ActivityStreams is essentially JSON-LD with a vocabulary for defining social activities. The LD in JSON-LD stands for "linked data," and this JSON format enables you to chain individual data items to each other in a lightweight, standards-based way. The result is a data graph like the aforementioned social graph. For example, within your social graph, your name is connected to a photo which in turn is connected to "Likes" which in turn are connected to the friends that clicked the like button on that photo and so on. JSON-LD, which is also a W3C standard, makes it easy for a machine to start with any data item and find the rest by traversing links, just as we might find related information by clicking around Facebook.

The most relevant components of an ActivityStreams object are the actor, the activity, and the object. The vocabulary defines actor types ("Person," "Organization," etc), activity types, which are expressed as verbs ("Add," "Create", "Like," "Delete," etc.), and object types ("Note," "Video," "Event," etc). The specification is very flexible: if you'd like to use a type that is not defined by the spec, such as an existing model defined by json-schema.org, you can link to that schema documentation in the object's @context property.

To illustrate how we can describe Facebook activities using ActivityStreams, take the following two JSON objects. The first is a status update from my Facebook timeline, retrieved from the Facebook Graph API. The second is that same status update, converted to ActivityStreams:

{
  "id": "10157170274089000_10157149090549111",
  "from":{"name":"Shelby Switzer","id":"10157170274089000", "link": "https://www.facebook.com/app_scoped_user_id/MNNpZADpBWEdmN2Q5ZA1h"},
  "message":"Excited for spring!",
  "created_time":"2019-02-26T17:58:43+0000",
  "privacy":{"value":"ALL_FRIENDS","description":"Your friends", "friends":"", "allow":"", "deny":""},
  "permalink_url":"https://www.facebook.com/10157170274089000/posts/10157149090549111",
  "status_type":"mobile_status_update",
  "type":"status",
  "updated_time":"2019-02-28T02:04:02+0000"
}

{
  	"@context": "https://www.w3.org/ns/activitystreams",
  	"summary": "Shelby Switzer shared a mobile status update",
  	"type": "Create",
  	"published": "2019-02-26T17:58:43+0000",
  	"actor": {
		"id": "https://www.facebook.com/app_scoped_user_id/MNNpZADpBWEdmN2Q5ZA1h",
		"facebookID": "10157170274089000",
		"name": "Shelby Switzer"
  },
  	"audience": {
		"name": "ALL_FRIENDS",
		"description": "Your friends"
  	},
  	"object": {
		"id": "https://www.facebook.com/10157170274089000/posts/10157149090549111",
		"facebookID": "10157170274089000_10157149090549111",
		"type": "Status",
		"content": "Excited for spring!",
		"updated": "2019-02-28T02:04:02+0000"
	}
}

When converting Facebook feed data to ActivityStreams, for every item in the feed, our Salvager project:

  • sets the Activity type as "Create" for every item in the feed,
  • generates a summary using the item's "from" property and the item's "status_type,"
  • sets the "actor" as the item's "from" property
  • creates an "audience" object using the item's privacy information
  • creates an "object" with the item's content, such as the Facebook photo URL and caption, the status update text, or the link URL.
  • sets the object's type to be the item's type defined by Facebook, which can be one of the following: "status", "photo", "link", "video"

The ID properties all correspond to the related "link" property provided by Facebook since ActivityPub prefers IDs to be URLs. I decided to persist the actual ID values as well, as the field "facebookID," because it may be useful later if generating new URLs or IDs.

The final output of the app's feed transformation is a JSON document that contains an ActivityStreams Collection object with the summary "Facebook Activity Feed" and an items array containing the transformed feed items.

Assumptions

Going into this, we had a few assumptions about how easy or difficult some parts of the process would be.

Assumption: Facebook will make it very difficult to get the data you're interested in.
Reality: True

We knew this was going to be difficult; especially getting data such as emails or phone numbers from your friends. The main value Facebook sells to users is the friend network, so naturally, they won't share it readily. With the Graph API, you can't get any information about your friends: their names, their contact info, or even the posts they've made on your wall unless those posts are public.

The only way you can get your current friend list is via the ZIP file discussed above — and even that list is just a list of names with no contact information. There is a workaround: provided an application like our Salvager app was approved by Facebook (it wasn't), if all of your friends approve the same app, then that app can access their data. However, as can be seen from the screenshot below, Salvager is in an unapproved state. It has been in that limbo for weeks despite a clearly estimated resolution of 5 days and a long-since lapsed promise directly from Facebook to address the request within 24 hours. Our assumption is that Facebook is balking on our application's main purpose to help users leave Facebook and take their data with them. So, mass adoption by your friends is unlikely.


Weeks have elapsed since ProgrammableWeb's Facebook Data Salvager app was set to the 5 day status shown above

Assumption: Understanding and traversing the graph will be straightforward
Reality: False

Graphs are complicated and not easily discoverable via the REST API that Facebook makes available to public developers. The API that Facebook uses internally for this purpose — based on the very non-RESTful GraphQL — is not available to the public. Within Facebook's REST API, the relationship between resource models is not expressed in the data, and there are no links hinting at what related objects or fields might be available for you to query for. Plus if straightforward documentation that explains how models relate to each other exists within Facebook's developer portal, we were not able to find it.

As some consolation, figuring out how to get the data I wanted from the API would have taken five times as much time without Facebook's Graph Explorer; a developer tool that allows you to set different permissions and test API calls from the browser. The Explorer enabled me to get OAuth tokens for my real user account and gave me a UI for searching resource models and fields that I could add to my query. However, even the Explorer doesn't expose all functionality and fields such as querying specific fields within the "from" property of a Post. Getting to the MVP of Salvager took some exploration using cURL in the command line as well.

Assumption: I might need to learn RDF to plug my data into the decentralized web.
Reality: False

When reading the Solid spec and realizing that the decentralized social web community heavily overlaps with the linked data and semantic web community, I began to fear that I'd have to spend a lot of time learning RDF or other semantic web protocols that I have very little familiarity with as a web developer.

Luckily I didn't have to dive down that rabbit hole. JSON-LD, which is technically a JSON-based serialization of RDF (which itself is XML-based), proved to be very accessible, and while ActivityPub's actor, activity (as a verb), and object concepts are reminiscent of RDF's Subject Predicate Object triple concept, it was much easier to grok and implement.

Challenges

Understanding permissions across the graph is hard

With OAuth access tokens, the developer must specify which permissions they want to request of a user, and Facebook has 36 different permissions you can request. As with the relationship between resource models, there is little explanation or discoverability of the relationship between permission types and resource models.

I still have no idea why some fields I requested were disallowed or "empty" (see screenshots below). The Graph Explorer in the dev portal doesn't give any information, and when attempting requests by command line, I received Oauth errors letting me know I requested a field I didn't have access to. But the error did not tell me which field it was.

This is especially frustrating when calling over 20 fields across the graph explicitly in one request; in order to isolate the problematic fields, I had to make a call for each field individually, one by one.

image 07

image 08

Mapping Facebook activity to ActivityStreams activities is not one to one

Perhaps highlighting a significant deficiency in the ActivityStreams specification, activity types core to Facebook such as "post," "story," and "status" are not default object types defined by ActivityStreams, nor could I find appropriate schemas on jsonschema.org to use instead. Therefore, when transforming Facebook activity feed items to ActivityStreams, I just used Facebook's terms as Object types and carried most of the associated attributes over to the ActivityStreams object one-to-one. For example, a data item from Facebook representing a new photo might have these attributes:

{
  "id":"10157170274089000_10157131857649222",
  "from":{"name":"Shelby Switzer","id":"10157170274089000", "link": "https://www.facebook.com/app_scoped_user_id/MNNpZADpBWEdmN2Q5ZA1h"},
  "name":"Shelby Switzer",
  "description": "Bad hair day",
"picture":"https://scontent.xx.fbcdn.net/v/t1.0-0/p130x130/52486516_10157131857624000_6208825175731863222_n.jpg?_nc_cat=108&_nc_ht=scontent.xx&oh=1f2632c4192683ee97350fac059c786e&oe=5D0FE2BB",
"permalink_url":"https://www.facebook.com/10157170274089000/posts/10157131857649222",
  "status_type":"added_photos",
  "type":"photo",
}

I've carried over Facebook-specific attributes such as name and description, and I've used the status_type, type, and from fields to represent the same information in the summary and object type in this ActivityStreams version of the data:

{
  	"@context": "https://www.w3.org/ns/activitystreams",
  	"summary": "Shelby Switzer shared an added photos update",
  	"type": "Create",
  	"object": {
    	"id": "https://www.facebook.com/10157170274089000/posts/10157131857649222",
    	"facebookID": "10157170274089000_10157131857649222",
    	"type": "Photo",
    	"name": "Shelby Switzer",
  		"description": "Bad hair day",
    	"picture": "https://scontent.xx.fbcdn.net/v/t1.0-0/p130x130/52486516_10157131857624000_6208825175731863222_n.jpg?_nc_cat=108&_nc_ht=scontent.xx&oh=1f2632c4192683ee97350fac059c786e&oe=5D0FE2BB",
  	}
}

Privacy is not defined by ActivityPub

As discussed above, the good news is that Facebook attaches some privacy information to its objects and that data can be retrieved via the API. The bad news is that the ActivityPub protocol expects every implementation to handle privacy in its own way and respect the privacy or visibility settings of its users. So, very oddly, there's no provision in the ActivityStreams specification for describing privacy or importing the privacy data available through Facebook's API. The closest I could find in the ActivityStreams spec was the "audience" property, which I used to describe Facebook's privacy setting associated with each activity feed item. I mapped privacy values from Facebook directly to audience names in ActivityStreams: for example, if a post's privacy value from Facebook is "ALL FRIENDS", then the corresponding ActivityStreams activity has an audience object with the name "ALL FRIENDS." Given the sensitivity around privacy issues today, I felt extremely uncomfortable building a data tool that didn't persist that data in the final result in some way.

Next Steps

After three weeks wrangling the Facebook Graph API and deciphering decentralized social web protocols, we have built a prototype application that pulls your Facebook data from the Graph API, saves it as raw JSON, and converts it into flat files of JSON ActivityStreams 2.0. We've discovered that Facebook imposes very real constraints on the data you can access, from the obfuscation of permissions and data relationships, intentional or not, to limiting access to your friends' information. If Facebook is truly committed to the Data Transfer Project, then perhaps we'll see some loosening of these restrictions. Until then, if we want to include posts from friends in our data salvaging, we'll have to build an app that Facebook will approve so that friends can grant those permissions.

From an interoperability perspective, there remain several challenges with our use of ActivityStreams and related protocols. There are some data points from Facebook that we haven't tried to transform yet, such as location and event data. We had to make assumptions and take liberties with attribute mapping, including critical privacy information. Applying ActivityStreams vocabulary to Facebook models such as Post, Status, or Story, was difficult. ActivityStreams' "Note" object type seems to be the closest option for any of these but doesn't feel like a natural replacement. It's possible that more widely recognized schema specifications exist that are better suited to describe these models, but we haven't found them yet. We may need to create them ourselves in order to take this project further.

Now that we have the data in a standards-based format, the next step is to plug it into another platform. In our next article, we dive into what the social web landscape currently looks like and whether any viable Facebook alternatives exist that we can migrate our social graph to, preferably by way of API.

Be sure to read the next Social article: Self-Hosted Personal Data is Key to Four Promising Facebook Alternatives

 

Comments (1)

donaldbmcintosh

Really enjoyed the article. Touches on many themes that I, and many others are concerned about. I am a huge fan of the concept of a private social graph that is your living breathing scrapbook of your life and experimence and interactions with others.

Some time ago I wrote my own software to allow me to build my own graph and render it as a website. Where people can explore my graph. Nodes are secured incidentally so I can control who sees what.

I never joined Facebook so don't have a legacy graph to import. There is so much more I want to do and add... But here is the current state of play. The application is called triki - a (semantic) triples wiki.

https://www.donaldmcintosh.net/triki

It runs my website.