A research project conducted by a North Carolina State University team indicates that public GitHub repositories leak API tokens and cryptographic keys in alarming numbers. In the summary report, How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories, the team described the research process and results. For nearly six months, the research team scanned GitHub repositories, which covered about 13% of the open-source repositories on GitHub.
During the scans, the team found that over 100,000 repositories suffered leaks of secret information. The researchers found that thousands of new leaks occur each day. The report summarizes the findings as follows:
"This work shows that secret leakage on public repository platforms is rampant and far from a solved problem, placing developers and services at persistent risk of compromise and abuse."
Not only is data leaked in the ordinary course, many leaks remained undetected by data owners for quite some time. Only 6% of API and cryptographic keys were removed within an hour of leak. Around 12% were removed within a day. 19% remained exposed for 16 days. 81% percent of the leaks the research team discovered were not removed at all.
Although these numbers look bad, GitHub suggests that many leaked tokens are likely void. GitHub's practice includes notifying service providers within seconds of leaks being made public.
The researchers shared their findings with GitHub. GitHub started a similar project when the NC State team neared completion of their project. GitHub had already been at work combatting the problem through various practices (i.e. Token Scanning) when the research was published. To date, GitHub has notified service providers about more than 100 million potential token matches for verification and revocation.