In a previous post
I wrote about how it’s possible to scrape emails from GitHub repositories using their API.
I even wrote up a Ruby script
to do this.
I now realize that is a very complicated way to go about it after discovering the git shortlog
command.
With git shortlog
you can list all contributor emails for any git repository, not just GitHub repos.
Disclaimer: I am writing about this to make others aware of this form of scraping and it is purely for educational purposes. I do not plan on doing anything with emails from git repos and you shouldn’t either
TL;DR
You can run this command within any git repo to extract all contributor emails:
|
|
Command Breakdown
The git shortlog -sea
part of the command is short for git shortlog --summary --email --all
. This command outputs the number of commits each user has made, along with their name and email, across all branches.
|
|
The next command, grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"
, extracts emails from each line using a regular expression.
|
|
The output from the previous command is piped into awk '{print tolower($0)}'
, which lowercases all the emails. Sometimes emails are typed in with capital letters. Lowercasing all characters will help with sorting and finding unique emails later.
|
|
After that, the output is piped into sort
and uniq
. These commands are straightforward. The emails are sorted alphabetically, then duplicates are excluded from the output.
|
|
That should suffice for a lot of git repos, but I also added grep -wv 'users.noreply.github.com'
to the end of the command to exclude noreply emails associated with GitHub.
|
|
Extracting Emails With git log
It’s possible to do something similar with the git log --pretty="%ce"
command. However, I noticed that this command does not show as many emails as git shortlog
. I didn’t look too much into it, but I believe it only pulls emails from one branch rather than all branches like with git shortlog --all
.
References
I learned about git shortlog
from this Stack Overflow question:
I got the email regex from here: