Recursively Downloading the NCSU Campus Directory

Posted on Feb 6, 2012

NCSU provides a Campus Directory that allows you to search for a student or faculty member, and will return certain information as long as they have not opted to be removed from the directory. The directory is very useful for finding a fellow student’s information, but can be publicly viewed by anyone without limitation.

I have written a small recursive script in PHP to go through the entire directory and save each student’s year in school, major, and email. This demonstrates how easy it would be for a business to create a list and send students targeted advertising spam emails. Or even worse- how easy it would be for a hacker to create a list and send students personalized targeted phising emails…

The Lists

Here are a few lists that I have compiled from my database in order to get my point across. These lists provide partial email addresses, so don’t expect to email the undergraduate body…

By Year in School (Undergraduate)

By Major

The Problem

Anyone with basic programming knowledge can recurse the NCSU Campus Directory and save all public user information, including name, year in school, major, and email addresses. They can then use this to send targeted emails to NCSU students. I have randomly received emails at my NCSU account from Textbook Websites, Student Loans, and other college-related businesses. I have even received an email from a Student, Ian Hill, with the subject “VOTE NOW! Ian Hill for Leader of the PACK”. I don’t believe that his targeted advertising ended up winning him the scholarship, but it was a notable attempt.

Now imagine this: A hacker sends a personalized email to everyone in your major requesting and it addresses you by name. The email references some new website that has been started for students in your major that allows them to network with employers. (or some other enticing email) When you click on the link, you are taken to a a fake NCSU WRAP login page. If you do not check the domain, you may try to login to the fake website and give the hacker your Unity ID and password.

The Solution

NCSU could easily put some controls on their Campus Directory search to make recursively downloading it much more difficult. The most obvious control would be to limit the campus directory search to a certain number of requests per IP Address per day. They could then expand upon this policy by allowing students and faculty to WRAP login in order to use more requests than the “guest” limit. Another possibility would be to require a CAPTCHA in order to execute a search.

Limiting requests was not invented yesterday- many popular online services limit the number of requests you can pump through their systems. A combination of CAPTCHAs to prevent robot use, a guest access policy, and a registered user access policy effectively keeps the services that we love most running on the internet.

The Code

I wasn’t just going to end the post without giving you some code! Here is the set of functions that I used. Note that you would run this script over and over again with CRON until the directory has completely downloaded. Doing it all in one shot would probably end up with PHP running out of memory before completing.

<?php

$requests = 0;

set_time_limit(0);
$query=retrieve_query();
if ($query!=-1)
{
	$this->_recurse_directory('',$query);
	//We don't reach the next statement until the directory has been completely
	//downloaded.  See the exit(); statement in recurse_directory()
	save_query('-1');
}
else
{
	echo 'Directory DB Complete';
}

function recurse_directory($query,$load='')
{
	if (empty($load))
	{
		$load='a';
	}
	for ($letter=ord(substr($load,0,1));$letter<=122;$letter++)
	{
		//Restrart the Recursion based off non-volatile
		if (strlen($load)>1)
		{
			recurse_directory($query.substr($load,0,1),substr($load,1));
			$load='';
		}
		else
		{
			$next_query=$query.chr($letter);
			save_query($next_query);
			//Kill the current script if the number of requests has
			//gotten too high so we don't run out of memory
			if ($requests>=50)
			{
				exit('Exiting...');
			}
			echo 'Query: '.$next_query.'<br />';
			$result=get_url($next_query);
			$requests++;
			
			if ($result['next']=='recurse')
			{
				recurse_directory($next_query);
			}
			else if ($result['next']=='continue')
			{
				continue;
			}
			else if ($result['next']=='insert')
			{
				continue;
			}
		}
	}
	
	function save_query($query)
	{
		/*************************************
		* Insert Code to save $next_query
		* in non-volatile memory
		* (text file, database, etc...)
		*************************************/
	}
	
	function retrieve_query()
	{
		/*************************************
		* Insert Code to read query
		* from non-volatile memory
		* (text file, database, etc...)
		*************************************/
	}
	
	function get_url($query)
	{
		/*************************************
		* Uses Simple HTML Dom
		* http://simplehtmldom.sourceforge.net/
		* Add require_once code for simplehtmldom
		*************************************/
		$html=file_get_html('http://www.ncsu.edu/directory/?lastnametype=starts&lastname=&firstnametype=starts&firstname=&emailaddresstype=starts&emailaddress='.$query.'&addresstype=contains&address=&phonenumbertype=ends&phonenumber=&departmenttype=contains&department=&titletype=contains&title=&searchtype=both&matchnicks=on&includevcard=on&matchprevlast=on&order=mixed&style=normal&search=Search');
		$results=$html->find('.resultsummary .leftjustify', 0);
			echo $results->plaintext.'<br />';
		$return=array();
		if (strpos($results->plaintext,'1 - 50')!==FALSE)
		{
			$return['next']='recurse';
		}
		else if (strpos($results->plaintext,'0 - 0')!==FALSE)
		{
			$return['next']='continue';
		}
		else
		{
			/*************************************
			* Insert Code to scrape the fields
			* you want off the directory webpage
			* and save them in your DB
			*************************************/
			$return['next']='insert';
		}
	}
}
	
?>