Saturday, February 15, 2014

MEMEX: a DARPA Project (DARPA-BAA-14-21)

MEMEX: a good idea, but it mostly depends on who will use it, and how...
It's always the same story. Human can be good and odd at the same time: nuclear research findings helped us fighting medical diseases, but also granted us with massive destruction capabilities.

MEMEX: what is all about?   


PART II: FULL TEXT OF ANNOUNCEMENT 
I. FUNDING OPPORTUNITY DESCRIPTION
The Defense Advanced Research Projects Agency (DARPA) is soliciting proposals for innovative research to maintain technological superiority in the area of content indexing and web search on the Internet. Proposed research should investigate approaches that enable revolutionary advances in science, devices, or systems. Specifically excluded is research that primarily results in evolutionary improvements to the existing state of practice.(...)
Overview
Today's web search is limited by a one-size-fits-all approach offered by web-scale commercial providers. They provide a centralized search, which has limitations in the scope of what gets indexed and the richness of available details. For example, common practice misses information in the deep web and ignores shared content across pages. Today's largely manual search process does not save sessions or allow sharing, requires nearly exact input with one at a time entry, and doesn't organize or aggregate results beyond a list of links.
The Memex program envisions a new paradigm, where one can quickly and thoroughly organize a subset of the Internet relevant to one’s interests.(...)


Well, I could simply submit my application to the DARPA telling them to look at what the NSA is able to do but it's not that simple anymore since Snowden's revelations ;-)


Why should I care?

My (main) concerns in red...

Source: the DARPA announcement is available here: https://www.fbo.gov/index?s=opportunity&mode=form&id=426485bc9531aaccba1b01ea6d4316ee

There is a PDF as well. Here are some extracts.


Technical Area 1: Domain-Specific Indexing 
Crawling should also be robust to automated counter-crawling measures, crawler bans based on robot behavior, human detection, paywalls and member-only areas, forms, dynamic and non-HTML content, etc.
Information extraction may include normalization of heterogeneous data, natural language processing for translation and entity extraction and disambiguation, image analysis for object recognition, coreference resolution, extraction of multimedia (e.g., pdf, flash, video, image), relevance determination, etc.
Technical Area 2: Domain-Specific Search 
Technical Area 2 includes the creation of a configurable domain-specific interface into web content. The domain-specific interface may include: conceptually aggregated results, e.g., a person; conceptually connected content, e.g., links for shared attributes; task relevant facets, e.g., key locations, entity movement; implicit collaboration for enriched content; explicit collaboration with shared tags; recommendations based on user model and augmented index, etc.

Erm... Maltego ?

Also, TA2 performers will work with TA1 performers on the design of a query language for directing crawlers and information extraction algorithms. A language to specify the domain, including both crawling as well as interface capability, may include concepts, sets of keywords, time delimitations, area delimitations, IP ranges, computational budgets, semi-automated feedback, iterative methods, data models, etc. Technical Area 3: Applications
Human Trafficking, especially for the commercial sex trade, is a line of business with significant web presence to attract customers and is relevant to many types of military, law enforcement, and intelligence investigations. The use of forums, chats, advertisements, job postings, hidden services, etc., continue to enable a growing industry of modern slavery. An index curated for the counter trafficking domain, including labor and sex trafficking, along with configurable interfaces for search and analysis will enable a new opportunity for military, law enforcement, legal, and intelligence actions to be taken against trafficking enterprises.
Other application domains will be considered during the life of the program, possibly including indexing and interfaces for found data, missing persons, counterfeit goods, etc.
Since technology development will be guided by end-users with operational support expertise, DARPA will engage elements of the DoD and other agencies to develop use cases and operational concepts around Human Trafficking and other domains. 
"Human Trafficking". Cool, no one can argue that it's not a legit one. I personally think the use case is relevant. Another one is money laundering, because it is very often bound to human trafficking and slavery. I suppose they are many other use cases that would benefit from such project. 
But still... What if it is used by an entity or personae with other goals in mind? (the list of "abnormal" activities is way too long to be written down here, and it depends on too many factors like culture, politics, education, regulations, etc. I told you, the list is not exhaustive)


2. Foreign Participation

Non-U.S. organizations and/or individuals may participate to the extent that such participants comply with any necessary nondisclosure agreements, security regulations, export control laws, and other governing statutes applicable under the circumstances. 
Others? Circumstances? A bit vague...
D. Other Eligibility Requirements
1. Ability to Support Classified Development, Integration and Transition

While the program itself is unclassified, interactions with end-users and potential transition partners will require Technical Area 3 performers to have access to classified information. Therefore, at the time of award, all prime proposers to Technical Area 3 must have (at a minimum) Secret facility clearance and have personnel under their Commercial and Government Entity (CAGE) code with a Secret clearance that are eligible for TS/SCI. Technical Area 3 proposers must provide their CAGE code and security point(s) of contact in their proposals. 
(...)
Proposers for Technical Areas 1 and 2 are not required to have security clearances.

Yes, this is a touchy project...


The one who will be able to really index all of the Internet, including the Dark Web will for sure be in a position to compete with substantial chances of winning any challenges (commercial, military, social).
They are legit fights, like the proposed use case of the project, or infancy protection for example, and many more. But I'm always scared of what we humans are able to do: turning something good into evil.
I am also anxious about a possible dichotomy of the Internet. 

The one who will be able to even further mangle this massive amount of data will somehow be able to achieve some forms of "prediction". 
If I am able to know, I will then be able to infer. Am I allowed to do so (regulations)?



Crawling should also be robust to automated counter-crawling measures, crawler bans based on robot behavior, human detection, paywalls and member-only areas, forms, dynamic and non-HTML content, etc.
If I am running a web server and configured this one not to be indexed (robot.txt), it is because I deliberately chose to do so ! If I am a member of a private or by-invitation only forum it is maybe because I am trying to dissociate my private and professional life. They are many other genuine examples I could mention.


  • The risk is that the Dark Web will become even darker, and with the possible event of global encryption this might be even trickier for non-aware people (read: not geek) to have access to information not under someone else scrutiny. 
  • This is raising numerous concerns, the major one being freedom of speech, and the difficulty to access different sources of information, some being controlled, and some being not. To make my mind, I like to read different newspapers; they are not necessarily from the same side of the Thames. 
  • Another one is the lack of awareness of our representatives when it comes to Information Technology. Most of them don't have a geek mindset and are therefore not able to assess nor regulate what is in the pipe in regard to Information gathering and the controls that can be applied on it. Indeed, the Internet today is (still almost) free, but far from being properly regulated. Why? Well, this is a picky topic isn't it? "Laws are like sausages - it is best not to see them being made" (Otto von Bismarck).

To sum:  


Our future will be massively interconnected, each object having its own personality (property, and methods... some OOP developers will like this); an object being a person, or a real object, tracking what they are really doing - the traces they are leaving all over the Internet over time - is to me a bit Orwellian. Nevertheless, this DARPA project is very interesting. There is a genuine need to map the entire Internet content in order to possibly infer evil behaviors and thus fight crime, avoid attacks before they happen, draw or graph interesting facts/events/ideas over time. 

As a security practitioner I see the benefits, while as a citizen I fear the misuse(s), especially since today's regulations are far from being suitable, or still being drafted.


RasKal, 2014 February the 15th.