Scrape Contributor Emails From Any Git Repository

In a previous post I wrote about how it’s possible to scrape emails from GitHub repositories using their API. I even wrote up a Ruby script to do this. I now realize that is a very complicated way to go about it after discovering the git shortlog command.

With git shortlog you can list all contributor emails for any git repository, not just GitHub repos.


Disclaimer: I am writing about this to make others aware of this form of scraping. I am simply exposing a privacy issue with git. I do not plan on doing anything with emails from git repos and you shouldn’t either.


Extracting Emails With git shortlog

Run this command within any git repo to extract all contributor emails:

1
git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq | grep -wv 'users.noreply.github.com'

Command Breakdown

The git shortlog -sea part of the command is short for git shortlog --summary --email --all. This command outputs the number of commits each user has made, along with their name and email, across all branches.

1
2
3
4
5
6
7
$ git shortlog -sea

    54  First Last <FirstLast@example.com>
   385  Another User <Anotheruser@example.com>
     2  user1 <user1@example.com>
    31  first last <firstlast@example.com>
    10  Someone Else <1234567+someoneelse@users.noreply.github.com>

The next command, grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b", extracts emails from each line using a regular expression.

1
2
3
4
5
6
7
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"

FirstLast@example.com
Anotheruser@example.com
user1@example.com
firstlast@example.com
1234567+someoneelse@users.noreply.github.com

The output from the previous command is piped into awk '{print tolower($0)}', which lowercases all the emails. Sometimes emails are typed in with capital letters. Lowercasing all characters will help with sorting and finding unique emails later.

1
2
3
4
5
6
7
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}'

firstlast@example.com
anotheruser@example.com
user1@example.com
firstlast@example.com
1234567+someoneelse@users.noreply.github.com

After that, the output is piped into sort and uniq. These commands are straightforward. The emails are sorted alphabetically, then duplicates are excluded from the output.

1
2
3
4
5
6
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq

1234567+someoneelse@users.noreply.github.com
anotheruser@example.com
firstlast@example.com
user1@example.com

That should suffice for a lot of git repos, but I also added grep -wv 'users.noreply.github.com' to the end of the command to exclude noreply emails associated with GitHub.

1
2
3
4
5
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq | grep -wv 'users.noreply.github.com'

anotheruser@example.com
firstlast@example.com
user1@example.com

Extracting Emails With git log

It’s possible to do something similar with the git log --pretty="%ce" command. However, I noticed that this command does not show as many emails as git shortlog. I didn’t look too much into it, but I believe it only pulls emails from one branch rather than all branches like with git shortlog --all.

References

I learned about git shortlog from this Stack Overflow question:

I got the email regex from here: