Streaming Salesforce Objects into Google BigQuery

My company uses a variety of enterprise-level tools, and one of the most central of them all is Salesforce. Our analytics stack centers around BigQuery, and we use Fivetran, an excellent integration service, to pipe our Salesforce data into BigQuery. Our problem is that Fivetran is a batch ETL, though, so there is a delay (often of hours) before our salespeople can see things reflected in our analytics tool.

Thus, our salespeople (the most important people in the company) don’t use our analytics tool, and just use Salesforce reporting, which means they miss out on a ton of enriched analysis we could be serving them.

So, my mission for the last few weeks has been to stream Salesforce into BigQuery. This is how I did it.

Picking a method

Salesforce is large and complicated, and over time has added a few possible methods for data replication.

The one that seemed most obvious when I started was something called PushTopics, which allow you to define a SOQL query centered around an object you are interested in streaming. If something changes in that object, the PushTopic will push a payload to your waiting CometD client (setting this up involved a whole separate blog post of discovery), which you can then relay to BigQuery.

The problem with PushTopics, for me, were twofold. First, you cannot stream LongText-type fields. We use those, so that’s a problem. Second, there is a limit to the length of your query, and SOQL will not allow Select * statements, so you have to list out all columns. Sure, you can set up multiple PushTopics for a single object and glue the columns together later in Python, but this is kind of insane.

My next best guess was platform messages. Platform messages take you into Apex, Salesforce’s Java-inspired programming language, which is not something I necessarily wanted to do. However, in exchange for this, you get to send pretty arbitrary payloads, and you lose the restriction on query size. All you have to do is set up something called an Apex Trigger, have it focus on the object you want to stream, and have that trigger send a message to the platform message bus, which would hit my eager CometD client.

Sounds pretty good, and once I got it working, I thought this might be our future. I wasn’t happy about having to use CometD, which requires I have some sort of server set up and running at all times, a massive potential point of failure, but life has hardships. I was about to turn my attention downstream when I saw something in the Salesforce Apex Trigger documentation that made me do a double take, and ended up being our ultimate solution.

That solution is callouts. Callouts are arbitrary HTTP requests – bypassing message buses and CometD clients completely. I set up a Google Cloud Function webhook endpoint that would save payloads to Google Cloud Storage and then I started experimenting. I was happy to find that, although I will never quite understand Apex datatypes, the results were fairly simple.

From what I can tell, the only real downside of callouts is a lack of replay ability. This would bother me more if it weren’t for two important factors:

  1. First, I’m using Google PubSub to intake messages, and uptime is virtually 100% (which is much higher than if I were running a linux cloud server myself).
  2. Second, we are keeping our Fivetran batch ETL integration, so I can occasionally combine and de-dupe two parallel versions of the same table (the streaming version and the batch version) in case any messages were missed.

So I overall feel pretty good about escaping arbitrary restrictions placed on us by Salesforce’s other solutions and going with this more bake-it-yourself strategy.


It’s nice that there aren’t many moving parts here – the basic recipe is an Apex Trigger (that sets the whole thing in motion when a data change occurs), an Apex “CalloutPubSub” class (to handle callouts to my Google PubSub topic’s endpoint), and a few entries in the Remote Site Settings page.

My prototype object for streaming has been Account, so everything here will center around that, but these lessons should be broadly applicable to other objects.

Apex Trigger

Apex Triggers can be defined in the developer console (click on your name in the top-right of Salesforce and, if you have the right permissions, you should see it in that menu). Once you’re in the developer console, go to File -> New -> Apex Trigger.

This is my first time writing in Apex so I’m sure this is awful, but my trigger is as follows:

trigger Account_message on Account (after insert, after update, before delete) {
 String query = 'SELECT Id,IsDeleted,MasterRecordId,Name,Type,RecordTypeId,ParentId,BillingStreet,BillingCity,BillingState,BillingPostalCode,BillingCountry,BillingStateCode,BillingCountryCode,BillingLatitude,BillingLongitude,BillingGeocodeAccuracy,ShippingStreet,ShippingCity,ShippingState,ShippingPostalCode,ShippingCountry,ShippingStateCode,ShippingCountryCode,ShippingLatitude,ShippingLongitude,ShippingGeocodeAccuracy,Phone,Fax,AccountNumber,Website,PhotoUrl,Sic,Industry,AnnualRevenue,NumberOfEmployees,Ownership,TickerSymbol,Description,Rating,Site,CurrencyIsoCode,OwnerId,CreatedDate,CreatedById,LastModifiedDate,LastModifiedById,SystemModstamp,LastActivityDate,LastViewedDate,LastReferencedDate,IsPartner,Jigsaw,JigsawCompanyId,AccountSource,SicDesc,AVA_SFCORE__Billing_Last_Validated__c,AVA_SFCORE__ExemptEntityType__c,AVA_SFCORE__Shipping_Last_Validated__c,Bubble_Up_By__c,Bubble_Up_Date__c,Business_Description__c,Completed_Activity_Count__c,Customer_ID__c,Do_Not_Sell_Reason__c,Do_Not_Sell__c,Escalated_By__c,Facebook__c,Google__c,Last_Activity_Date__c,Last_Activity_Notes__c,Last_Activity_Type__c,Last_Lead_Source_Name__c,Lead_Dev_Locked__c,Lead_Source_Name__c,Lead_Source__c,LinkedIn__c,Marketo_DB_URL__c,Marketo_ID__c,Most_Recent_Opportunity__c,Most_Recent_Profile__c,NetSuite_Customer_ID__c,NetSuite_Partner_ID__c,New_Lead_Date__c,Owner_Priority__c,Partner_Affiliate_Code__c,Partner_Affiliate_Password__c,Partner_Code__c,Partner_Commission_Percent__c,Partner_ID__c,Partner_Lead__c,Partner_Program__c,Partner_Progression_Status__c,Partner_Sales_Owner__c,Partner_Sales_Territory__c,Partner_Start_Date__c,Partner__c,Pinterest__c,Primary_Contact__c,Primary_IS_App__c,Qualification_Activity_Count__c,Qualification_Status__c,Reason__c,Sales_Rep__c,Skype__c,Twitter__c,Unique_Key_Code__c,You_Tube__c,ByPass_Edit_Validation__c,Customer_Referral_Date_From_MA__c,Lead_Reason__c,Make_Callout_Insert__c,Make_Callout_Update__c,NetSuite_Prospect_ID__c,Partner_Affiliate_Code_From_MA__c,Partner_Last_Referred_Date__c,Partner_Referred_Date_from_MA__c,Referral_Type__c,Referring_Customer_App_Name__c,Referring_Customer_Email__c,Referring_Customer_Name__c,Surpress_Callout__c,SyncToNetSuite__c,TriggerAccountUpdate__c,Partner_Business_Type__c,Referral_Notes__c,Purchased_Lead_Notes__c,Marketing_Qualified__c,Most_Recent_Referral_for_Owner__c,Default_Partner_Campaign__c,Parent_Primary_Contact__c,Partner_Type__c,Contract_App_name__c,Contract_App_status__c,Date_customer_signed_up__c,Edition_of_Infusionsoft_app__c,Most_Recent_Asset__c,Partner_Recruiter__c,Partner_Sales_YTD__c,NAICS_Code__c,Campaign__c,Partner_from_MA__c,Tracking_Link__c,Event_Pricing__c,Name_of_Event__c,Speaker__c,Time_Zone__c,Total_Assets__c,AVA_SFCORE__Billing_County__c,AVA_SFCORE__Shipping_County__c,AdvocateHub__Referral_Source__c,Referral_ID__c,Annual_Revenue_bp__c,Business_Description_bp__c,Business_Type_bp__c,Channels_Points_of_Sale_bp__c,How_was_list_built_bp__c,Industry_bp__c,Key_Products_Services_bp__c,Marketing_Automation_System_bp__c,Marketing_List_Size_bp__c,Monthly_Marketing_Budget_bp__c,Stage_Number_Value_bp__c,Sub_Industry_bp__c,Who_Manages_Website_bp__c,Years_in_Business_bp__c,of_Employees_bp__c,CRM_System__c,Challenges_Needs__c,Ecommerce_System__c,Recalculate_Sharing__c,Link_Posted_By__c,MBR_Has_Active_Contract__c,MBR_Is_Partner_Customer__c,MBR_Most_Recent_App_Edition__c,MBR_Most_Recent_Contract_Start__c,MBR_Most_Recent_Kickstart_Completion__c,MBR_Not_A_Free_App__c,Partners_Success_Coach__c,Value_Added_Reseller_Track_Level__c,Service_Track_Level__c,App_Developer_Track_Level__c,Original_App__c,Aggregate_CHS__c,Partner_Service_CHS__c,Partners1_ID__c,partner_overview_record__c,NAICS_Name__c,SIC_Code__c,SIC_Name__c,UniqueEntry__Account_Dupes_Ignored__c,rrpu__Alert_Message__c,Datanyze_Address__c,Datanyze_Alexa_Rank__c,Datanyze_City__c,Datanyze_Country__c,Datanyze_Description__c,Datanyze_Employees__c,Datanyze_Funding__c,Datanyze_Industry__c,Datanyze_Monthly_Tech_Spend__c,Datanyze_Phone__c,Datanyze_Public_Text__c,Datanyze_Revenue_Text__c,Datanyze_State__c,Datanyze_Tags__c,Datanyze_Twitter__c,Datanyze_Year_Founded__c,Datanyze_Zip__c,Datanyze_Technologies__c,Datanyze_Technology_History__c,Datanyze_Predict_Score__c,Datanyze_Predict_Rating__c,Customer_Success_Guide__c,Datanyze_Sync__c,Wepay_Discussed__c,Partner_Serviced_Customer__c,Service_Partner__c FROM Account where id in (';
 String endpoint = '';
 Map<String, Object> data_list = new Map<String, Object>();
 Map<String, String> metadata = new Map<String, String>();
 if (Trigger.isDelete) {
 List<Account> list_of_changes = Trigger.old;
 List<String> list_of_changed_ids = new List<String>();
 for (Account o : list_of_changes) {
 list_of_changed_ids.add('\'' + String.valueOf(o.get('Id')) + '\'');
 query = query + String.join(list_of_changed_ids, ', ') + ')';
 List<sObject> new_data_list = Database.query(query);
 metadata.put('Type', 'Delete');
 data_list.put('metadata', metadata);
 data_list.put('payload', new_data_list);
 CalloutPubSubClass.makeCallout(JSON.serialize(data_list), endpoint);
 } else {
 // Create an instance of the event and store it in the payloadUpdate variable
 List<Account> list_of_changes =;
 List<String> list_of_changed_ids = new List<String>();
 for (Account o : list_of_changes) {
 list_of_changed_ids.add('\'' + String.valueOf(o.get('Id')) + '\'');
 query = query + String.join(list_of_changed_ids, ', ') + ')';
 List<sObject> new_data_list = Database.query(query);
 metadata.put('Type', 'Update');
 data_list.put('metadata', metadata);
 data_list.put('payload', new_data_list);
 CalloutPubSubClass.makeCallout(JSON.serialize(data_list), endpoint);

The script here is split into delete and update/insert actions, with update and insert being treated the same. Notice in the trigger declaration I use after insert and after update but I use before delete – this means that I want to run this trigger after insertions or updates (which makes sense – want the new stuff) but before a deletion occurs. This doesn’t matter terribly much since the Trigger object has Trigger.old and attributes, but I like to keep everything aligned regardless.

Another thing to notice is my extremely specific query. This came from Fivetran – in our salesforce BigQuery dataset where Fivetran loads data, there is a salesforce.fivetran_query table that lists the specific query that Fivetran runs when getting data from the Salesforce API. My job is to replicate Salesforce’s work, so I’m running the exact same query, and then using a loop and string concatenation to tack on IDs that have changed in the WHERE clause of the query.

Beyond that, things are fairly simple. I gather all of my changed IDs, add them to my query, run said query, and then take that data and put it into a payload. I add some metadata so that I know downstream if this is a delete or update/insert (which I just label as an update) and then I send it to the CalloutPubSubClass, which I explain next.

Apex CalloutPubSub Class

I edit my Apex Classes outside of Developer Console, for some reason – I think some guide sent me this direction? Anyways, in Salesforce I click on Setup, and then type in the Quick Find box “apex class” and it comes up. But yeah, you can also do this in Developer Console.

My garbage can of Apex code:

public class CalloutPubSubClass {
    public static String getAccessToken() {
        String url = '';
        String client_id =<REDACTED>;
        String client_secret = <REDACTED>;
        String refresh_token =<REDACTED>;
        String body = 'client_id=' + client_id + '&client_secret=' + client_secret + '&refresh_token=' + refresh_token + '&grant_type=refresh_token';
        HttpRequest req = new HttpRequest();
        Http http = new Http();
        HTTPResponse response = http.send(req);
        String token_response = response.getBody();
        Map<String, String> token_json = (Map<String, String>)JSON.deserialize(token_response, Map<String,String>.class);
        String access_token = token_json.get('access_token');
        return access_token;

    public static void makeCallout(String payload, String endpoint_raw) {
        String access_token = getAccessToken();
        Blob myBlob = Blob.valueof(payload);
        String payload64 = EncodingUtil.base64Encode(myBlob);
        Map<String, Object> payload_json = new Map<String, Object>();
        Set<Object> messages = new Set<Object>();
        Map<String, String> data = new Map<String, String>();
        data.put('data', payload64);
        payload_json.put('messages', messages);
        String endpoint = endpoint_raw + access_token;
        HttpRequest request = new HttpRequest();
        request.setHeader('Content-Type', 'application/json;charset=UTF-8');
        if (!Test.isRunningTest()) {
            HttpResponse response = new HTTP().send(request);

Much of what goes on here is specific to dealing with Google PubSub – for instance, the whole getAccessToken() method is part of that. If you’re trying to do that, then this code block will be exceptionally useful, but if you’re not then just ignore it.

One line, @future(callout=true), is exceptionally important – I think? I think it makes this function asynchronous, which is great for situations in which a bulk change occurs and the trigger chunks it into 50 different payloads. A nice man on the internet told me to put it there so I did.

Also note the base64 encoding step – this is not necessary for most people but will be necessary for sending payloads to Google PubSub. It took me approximately fifty tries to get this right, since Google somehow turns the biggest collection of engineering talent on earth into the worst collection of engineering documentation on earth, but whatever. I wouldn’t hold a grudge.

None of this will work if you implement, though, because there is one more step that I forgot every time I had to change endpoints.

Remote Site Settings

Salesforce tries to maintain a pretty secure environment, and part of that is not letting Apex code within Salesforce make random callouts with who-knows-what kind of data leaving the site. So, you need to whitelist your endpoint, and also (if applicable) whitelist the endpoint for obtaining your access token. Just go to Setup and type “remote site” into the QuickFind box and you’ll see the page you need. I left the Disable Protocol Security box unchecked because I would reliably be using HTTPS.

Further down the line

The majority point of writing this is to have something to show our Salesforce Admins to convince them to take my code out of sandbox and put it into production, so what happens downstream isn’t a big concern. I’ll give a broad sketch and save the details for another post.

Data is chunked into 200 records at a time by the trigger, then put through our specific query, and then sent to PubSub. Thankfully, Salesforce does not include nulls in their query results, so the payload sizes are very manageable and easy to process.

My PubSub topic’s only subscriber right now is a Google Cloud Function, which takes the payload and determines if it is a delete or update object. If it is delete, then it changes last_modified to the current UTC time and changes is_deleted to True, then streams it into BigQuery. If it is an update then it can skip those transforms and go straight to BQ. A copy of the original payload is also sent to Google Cloud Storage, which I may turn off once everything is stable.

And that’s it! I can’t believe that took me two weeks, but at least it’s done now. I’ll probably be making posts about CometD and my Google Cloud Project pipeline for this soon.

Taking notes while debugging and The Fifteen Minute Rule™

I recently had an issue at work that seemed simple but took me two days to solve. The problem was a before-and-after discrepancy in a Looker dashboard whose historical values had changed after doing some migrations on our data warehouse.

Our Looker instance works off of a waterfall of derived tables, and this Looker dashboard was part of this. The particular look that was causing problems pulled data from a table called revenue_retention_and_contraction, which in turn pulled from several antecedent tables. I took the logic for revenue_retention_and_contraction‘s pre-migration and post-migration versions, and changed table references in the post-migration version to pre-migration table references until I got the two to align. This would tell me what the problematic antecedent table was, and then I could move on to that table and repeat this process.

This sounds simple, and it likely should have been, but I completely failed to keep track of my steps as I worked. It’s tremendously tempting to want to just jump in and start hacking away when a problem with real business consequences arises, and sometimes this is the proper way to go – problems that can be resolved in fifteen minutes or so, for instance, hardly need notes.

Past the fifteen-minute-barrier, though, one’s connection to reality starts to become tenuous. I couldn’t remember what I had tried and hadn’t tried and I couldn’t keep my dozen BigQuery tabs straight. I felt crushed after my first day of troubleshooting.

The next morning, I opened Notes on my Mac and laid out the situation. I wrote out each piece of knowledge I had developed, even pasting in the hyperlinks to the queries I ran to establish my knowledge. It was simple, but it improved my focus on the one thing that ended up mattering: two missing columns in an obscure (and gigantic) table called netsuite_transactions__base_with_prepay_spread. I found the row count identical in the pre- and post-migration tables, but using a Jupyter notebook to pull samples from each table through the BigQuery API, I noticed that the pre-migration version had two more columns. These columns, it turned out, produced a small fan-out that affected our aggregates and thus were the cause of our problems.

From now on, I’m operating on a Fifteen Minute Rule: if I can’t fix an issue in fifteen minutes, I’ll open up a fresh Note and jot down what I have learned so far. Even if it is a problem that I might be able to solve in twenty minutes, reiterating to myself what I have learned is invaluable for understanding the context of a problem and knowing for sure that I have quashed our bug.

Cryptocurrencies and the fundamentals

I bet it’s awkward taking this sign down every time transaction fees spike.

All currencies have inherent value, including fiats (ability to pay taxes) and gold (industrial and aesthetic applications). Bitcoin changed the world with its inherent values of anonymity, speed, and accessibility. However, it now lags behind its competitors in these features, and unlike gold, bitcoin has no other practical use.

Understanding why this is true, and why it matters, requires a review of some fundamentals.

Why Money?

Money exists to alleviate the problem of needing a ‘coincidence of wants’ in order to exchange – a fundamental limitation of barter systems. For example, if a programmer wanted to purchase a tank of gas, in the absence of money, they would need to find a gas station that needed programming services so that they could barter a tank of gas’ worth of programming. It is very costly for the programmer to run into every gas station he sees to ask if they need programming services, and fairly unlikely that he will find a station that needs it.

With money, however, the programmer can provide programming services for a totally unrelated company, which then pays him money, which he can then give to the gas station in exchange for gas. There is no need for a coincidence of wants, so he is enabled to make these more complicated exchanges that undergird modern market economies.

Money is also a store of value, meaning it can transfer value across time. In a barter system, our example programmer would need to provide his programming services at the same time he received gas – otherwise, the neither party in the transaction would have a way of trusting that the other party would come through on their end of the bargain. This store of value can be for a few days, or for fifty years – money allows the programmer to save for retirement, ‘cashing in’ during his later years the value that he created earlier in his life.

Money as a Product

People like to say that fiat currencies have no inherent value, and that their worth is just a “shared illusion”, but this is completely wrong. The inherent value of fiat currencies is that they allow you to pay taxes, something you must do if you do not want to go to prison. So, fiat currencies save you from going to prison. This makes fiat currencies necessary for every person to acquire, which in turn makes them a convenient unit for all exchanges (not just taxation) within a country.

With such advantages, it’s no wonder that currency competition before cryptos was minimal! Nobody producing their own “money” could possibly compete with the advantages of dealing in dollars.

Historically, gold has had similar inherent value. Its rarity, resistance to corrosion, and fairly simple testability of its purity made it a useful way for untrusting trade partners (i.e. states) to trade with one another. Each state produced their own gold coins for international trade, which in turn were often used for taxation within the state. Again, if you do not pay your taxes, you will go to prison, and prisons in the year 1200 were even less pleasant than today. This was the inherent value of gold for states and for commoners.

Gold’s exchange value has dropped dramatically as governments have moved to dealing in fiat currencies of their own creation. Other inherent values of the metal, mainly industrial in nature, give it value, along with a popular affinity for gold as a material in jewelry and other fine products. This can make gold a useful product for investment, or a ‘store of value.’ However, gold is no longer a currency, as it is not a useful product in that way.

Bitcoin’s Value

Bitcoin launched in 2009 with a unique sales pitch: a currency that would allow instant transactions outside of banks and financial institutions, with minimal fees, and high anonymity.

It’s easy to see the inherent value to all of this. If one was a drug dealer or other anti-state actor, bitcoin provided you with the ability to exchange value over digital mediums. Previously, criminal enterprises, which are locked out of the financial system, had to perform large transactions with extremely risky and inconvenient physical cash transfers. Now, one could simply send bitcoin to a trading partner in a matter of moments. Online crime markets that centered around bitcoin sprang up quickly.

Likewise, other anti-state actors could perform transactions. This included persons living in repressive regimes, those dealing with border currency controls, persons seeking to avoid taxes, or those who are generally concerned about financial privacy.

Again, these are real and inherent benefits that gave bitcoin very meaningful value. Explaining this value did not require talking about the price trajectory of bitcoin or what financial institution might start trading in it. These are on-the-ground realities that made bitcoin useful.

Alternative Coins

Today, there are many competitors to bitcoin, all of which share bitcoin’s benefits. In fact, almost every cryptocurrency has, at minimum, bitcoin’s fundamental features.

Of course, network effects are strong, so it is not good enough to just be as good as bitcoin in fundamental features. If a cryptocurrency does not provide significant benefits over bitcoin, then the strength of bitcoin’s network – the development infrastructure, the miners, the organizations which accept bitcoin, etc – will overshadow the newcomer. Toppling bitcoin requires a currency that is significantly better in fundamental features than bitcoin.

An example of such a challenger is dash, which was launched in 2014 and offers significant fundamental benefits that bitcoin does not.

One of these benefits is completely anonymous transactions. Bitcoin’s transactions are publicly viewable through the blockchain, something that even an amateur cryptocurrency enthusiast like myself can view through easy-to-use websites. Dash’s completely anonymous and invisible transactions provide fundamental value to anti-state actors, a major market for cryptocurrencies – regardless of one’s feelings about these users.

Another benefit is the ability to send dash instantly. Bitcoin’s transaction times have slowed to a crawl in recent years, sometimes taking as long as an hour to clear a single transaction. This significantly reduces the exchange value of bitcoin. Furthermore, transaction fees have been on the rise, something that isn’t important to million-dollar drug deals but is crucial to using bitcoin to buy a cup of coffee. The only reason a coffee shop, which must pay taxes in dollars, will realistically accept cryptocurrencies is to avoid the high fees of credit and debit cards while retaining the digital advantages both provide. Once the cryptocurrency itself starts charging a transaction fee, however, accepting it becomes truly pointless.

Network effects will keep competitors such as dash at the perimeter for the time being, but fundamental advantages can’t be argued away by orthodoxy for long. Over time, more anti-state actors will begin using more highly-anonymized and rapidly-transacting currencies like dash, as it protects their businesses and their lives. As more economic actors are equipped with dash, it will make more sense to accept it for smaller and smaller transactions, from property purchases, to cars, to the eventual cups of coffee. The fast transaction times and low transaction fees will ensure this gradual acceptance for smaller and smaller transactions.

Bitcoin’s future

So, is bitcoin then the “gold of cryptocurrencies”? Will it remain as a legacy product, providing a store of value for investors who want a safe harbor?

Probably not.

Unlike gold, there are no industrial applications for bitcoin. Bitcoins are not resistant to acids or especially useful for conducting electricity. The code that makes up bitcoin’s blockchain is not specifically more resistant to physical corrosion than any other code.

Likewise, bitcoin has no (apparent) aesthetic value. Nobody one hundred years from now will be wearing jewelry made out of bitcoin, and they won’t be eating from bitcoin-gilded silverware and plates.

Gold is valuable because of its inherent properties and its aesthetic beauty. Its major inherent property was displaced when governments changed their preference for taxable currencies, but its other inherent properties remain.

Bitcoin was able to best fiat currencies in many ways, a technological revolution in money that was extremely exciting to witness. However, technology is a double-edged sword, and it is now cutting against bitcoin in the same way bitcoin cut against fiats. Everything bitcoin has done, Dash can do better, and what’s worse, there are no remaining inherent properties for bitcoin.

The mania surrounding bitcoin at the moment cannot last, because the fundamentals are not there. Articles in the Wall Street Journal and acceptance from the Chicago Mercantile Exchange are not fundamentals, and they will not affect, in the long-rung, the currency that people on the ground use in order to exchange. From illegal arms deals to coffee, dash beats bitcoin in every way.

Cleaning up garbage files in an AWS S3 bucket

I did not ask for this.

I manage a very large project called State RegData for my employer, the Mercatus Center. This will eventually involve one hundred repositories, two for each state, and a ginormous amount of data. We back up the data in an s3 bucket, and sometimes the sync process with the AWS CLI is… slow. I accidentally messed up one of these very slow syncs, and since then my bucket has been 99% of what you see above – in addition to the 13 neat directories that actually should be in there.

I’ve been putting off cleaning this up until now. I found some pretty good documentation and an even better blog post to guide me.

The main problem here is that the AWS CLI doesn’t allow for wildcards in file paths, so you can’t just say aws s3 rm s3://state-regdata/s3-state*. However, the --include and --exclude flags allow you to exclude everything (--exclude '*') and then include what you want (--include 's3-state*'), which allows you to do the same exact things.

At first I ran the command aws s3 rm s3://state-regdata --exclude '*' --include 's3-state*' and my command line sat there for a while, pretending to work, and then finished. Turns out, it did nothing, so I compared notes with the blog post linked above and noticed that he ended his bucket address with a / and used the --recursive flag.

I tried that and after streaming through 600,000 garbage file names, I am back in business.

Do guns kill people? Or do gun owners?

I’ve worked hard to get this fat, pal.

So I thought this image was interesting because of one of its key comparisons is actually quantifiable and comparable. Unfortunately, I can’t tell you how many misspelled words pencils cause, or how many pounds of fat we can blame spoons for, but I can definitely talk about what cars are costing us in terms of human life.

In 2015, roughly 38,000 Americans died in motor vehicle accidents (s). Meanwhile in that year, there were 264 million registered cars in the US (s). That means there was one death for every 6,950 cars.

The ratio for guns is actually not so different. In 2013 (the most recent year I could find) there were roughly 300 million guns in the US (s). Meanwhile, there were 33,636 gun deaths (s). That means there was one death for every 8,920 guns.

So the average car is about 1.3 times as deadly as the average gun. I would have thought car danger would have blown gun danger out of the water, given how useful cars are and how sort-of-not-useful guns are (spoken like a city person).

This is also a bit surprising given how concentrated gun ownership is, which makes me wonder who is more dangerous, the average car owner, or the average gun owner?

In 2013, only 22 percent of US adults owned guns (s), and since there were 243 million adults in the US (s), this means that there were 53.5 million gun owners.

Cars seem much more widely distributed. While it’s tough to find information on the number of car owners, we can look at the number of licensed drivers, which stood at 218 million in 2015 (s).

Comparing these numbers to the death numbers given above, we can see that there was one car death per every 5.7 million drivers. For every gun death, meanwhile, there is only 1.4 million gun owners.

The average gun owner is four times as dangerous as the average driver!

tf-idf vectorizer fit and transform

I experienced a bit of a breakthrough messing around with the scikit-learn’s tf-idf vectorizer under the guidance of my Springboard mentor, Hobson Lane. In this example I am working with data that I scraped from the Mercatus Center website. I originally scraped a corpus of over 5000 documents containing all published output, excluding charts, including multi-label duplicates. However, in this example I worked with a smaller subset of 27 documents to avoid having to wait 10 minutes for something to finish running.

Repository in case you are interested in running this yourself.

Setting up

The first cell of my jupyter notebook is imports and setting matplotlib inline:

from sklearn.datasets import load_files
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

%matplotlib inline

Then, I used the load_files function to load my multi-label corpus, which is just files placed in directories whose names are the label names:

trainer = load_files('data_short')

You can use the following two attributes of trainer to get an idea of what you have:


Finally, we should initialize a dataframe and add a few useful columns to it.

df = pd.DataFrame()
df['filename'] = trainer.filenames
df['text'] = [open(f, 'r').read() for f in trainer.filenames]
df['label'] = [trainer.target_names[x] for x in]
df['author'] = [x.split('__')[0].split('--')[0].split('/')[-1] for x in df['filename']]

Fitting the tf-idf vectorizer to your data

Term frequency – inverse document frequency, or tf-idf, is a neat little statistic that tells you how important a particular word is for a document, relative to all the other documents in a corpus. As the name suggests, the formula is approximately the number of times the word appears in the document, divided by the number of times it appears in the entire corpus (the real math is here but you don’t need it). Words like “and” will clearly have very low tf-idf scores, then, while a word like “finance” will have a very high tf-idf score in a finance article if the rest of the corpus is about cooking. This is nice, as it is a quantified way of telling us that there is something very special about this one article’s topic, compared with the rest of the articles!

The first order of business is to initialize a tf-idf vectorizer, which we can then use to attack our corpus.

tfidf = TfidfVectorizer(min_df=3, max_df=0.9)

min_df indicates the minimum number of documents a word must be in to count – this is a way to avoid counting proper nouns and other words that do not tell us much about a document’s topic. max_df indicates the maximum proportion of documents a word can be in before it counts – this is how we avoid counting words like “and”, which would slow down our processing times and contribute little to understanding individual documents.

Now we are ready to “fit” the vectorizer to our list of document texts, as stored in df['text']. Fitting is the process of converting every word in the corpus to a number, and storing these word-number pairs in a Python dictionary.

tfidf =['text'])

This was a bit of a breakthrough for me, as I never understood exactly what “fit” meant, despite probably reading a definition a dozen times. You can see your new dictionary by typing in the following:


Pretty cool stuff!

Using tf-idf transform

My goal here is simply educational, so I haven’t done anything too fancy yet. At the advice of my Springboard mentor, I created some “topic” lists with some key terms that could define each topic, then calculated each document’s cumulative tf-idf score for those topic words. You can then put these list of scores in the dataframe we created and do some cool stuff!

First, creating the topic lists:

finance_words = ['finance', 'economics', 'financial', 'stocks', 'dollars', 'equity', 'bond', 'bonds', 'commodity']
tech_words = ['technology','tech','machine','computer','software','internet','hardware']

These are pretty good guesses of what might be in the average Mercatus paper. I completely made these lists up, so don’t take this as a literal topic mapping!

Next, we need to loop through a transform() of the text column in our dataframe, df['text']. Each x in the tfidf.transform(df['text']) object below represents a document in our corpus, repesented as a matrix mapping numbers in our tfidf.vocabulary_ to tf-idf scores. In this way, each x represents a tf-idf score for each word in a document. Pretty cool!

topic = []
for x in tfidf.transform(df['text']):
 fin =[x[(0, tfidf.vocabulary_[w])] for w in tfidf.vocabulary_ if w in finance_words])
 tec =[x[(0, tfidf.vocabulary_[w])] for w in tfidf.vocabulary_ if w in tech_words])
 topic.append([fin, tec])
df_topic = pd.DataFrame(topic, columns=['finance','tech'])

The body of this for loop defines a fin and tec variable, each of which are the sum of tf-idf scores for financial and tech words that appear in that particular document. We then take these two numbers and append them as a pair to the topic array that we initialized in the beginning of this code block. Once the for loop runs through every document in our transform, we can make a dataframe with this list of pairs as two columns, 'finance' and 'tech'.

Plotting the results

What we have now is a dataframe called df_topic that has two columns with some very cool data: scores for how related each document is to the topics of finance and technology, as we defined them in our finance_words and tech_words lists above.

This is perfect for a scatterplot! As usual, pandas makes the task very simple:

df_topic.plot.scatter(x='finance', y='tech', xlim=(0,0.31), ylim=(0,0.31))

Here we had to define what the x-axis and y-axis should be, along with some limits on those axes. The result is pretty cool:

As we can see, every paper seems to have at least a little something to say about technology, but many papers have absolutely nothing to do with finance. However, there is definitely a skew towards finance-related papers, with one paper having a large focus on the topic.


Obviously this is only modestly useful in application, but the exercise gave me a sense of the nuts and bolts of tf-idf fits and transforms work. For more insight, I suggest creating a transform() and storing it in a variable so that you can play around with the results. Something like this:

tf_transform = tfidf.transform(df['text'])

will allow you to peak into the x‘s that we iterated through earlier:

for x in tf_transform[26]:

Good luck and I hope you enjoy playing around with this stuff as much as I do!

Making your WordPress file writable using Linux command line

I do not want to read a goddamn codex.

I ran into this problem while trying to rob WordPress of their proper recognition while setting up this site. I’m hosting on AWS, so I have an SSH connection open that allows me to interact with files through the command prompt. If you are not familiar with what I just said then you should find a different post about this.

Anyways, it’s fairly simple, but if you’re only vaguely familiar with Unix permissions then this might be a useful tidbit. I changed directory until I was in the theme’s subdirectory and ran this:

chmod 746 style.css

If you are dealing with a different file than style.css, you can adjust this command accordingly. This opens things up just enough to be workable, as outlined by the list in this section.