In a
previous post I wrote about how it’s possible to scrape emails from GitHub repositories using their API.
I even wrote up a
Ruby script to do this.
I now realize that is a very complicated way to go about it after discovering the git shortlog command.
With git shortlog you can list all contributor emails for any git repository, not just GitHub repos.
You can run this command within any git repo to extract all contributor emails:
git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq | grep -wv 'users.noreply.github.com'Command Break Down
The git shortlog -sea part of the command is short for git shortlog --summary --email --all. This command outputs the number of commits each user has made, along with their name and email, across all branches.
$ git shortlog -sea
54 First Last <FirstLast@example.com>
385 Another User <Anotheruser@example.com>
2 user1 <user1@example.com>
31 first last <firstlast@example.com>
10 Someone Else <1234567+someoneelse@users.noreply.github.com>The next command, grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b", extracts emails from each line using a regular expression.
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"
FirstLast@example.com
Anotheruser@example.com
user1@example.com
firstlast@example.com
1234567+someoneelse@users.noreply.github.comThe output from the previous command is piped into awk '{print tolower($0)}', which lowercases all the emails. Sometimes emails are typed in with capital letters. Lowercasing all characters will help with sorting and finding unique emails later.
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}'
firstlast@example.com
anotheruser@example.com
user1@example.com
firstlast@example.com
1234567+someoneelse@users.noreply.github.comAfter that, the output is piped into sort and uniq. These commands are straightforward. The emails are sorted alphabetically, then duplicates are excluded from the output.
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq
1234567+someoneelse@users.noreply.github.com
anotheruser@example.com
firstlast@example.com
user1@example.comThat should suffice for a lot of git repos, but I also added grep -wv 'users.noreply.github.com' to the end of the command to exclude noreply emails associated with GitHub.
$ git shortlog -sea | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" | awk '{print tolower($0)}' | sort | uniq | grep -wv 'users.noreply.github.com'
anotheruser@example.com
firstlast@example.com
user1@example.comExtracting Emails With git log
It’s possible to do something similar with the git log --pretty="%ce" command. However, I noticed that this command does not show as many emails as git shortlog. I didn’t look too much into it, but I believe it only pulls emails from one branch rather than all branches like with git shortlog --all.
References
I learned about git shortlog from this Stack Overflow question:
I got the email regex from here: