RE: file hashes

From: Skinner, Andrew (NBC Universal) <Andrew.Skinner_at_nbcuni.com>
Date: Thu, 14 Jun 2007 17:00:47 -0700

Hey Jay,

 

I'd like to set up a call to discuss the user hash situation. Will you
be available tomorrow morning?

 

 

Andrew

 

________________________________

From: Jay Mairs [mailto:jay_at_mediadefender.com]
Sent: Wednesday, May 23, 2007 12:29 PM
To: Skinner, Andrew (NBC Universal); Markham, Aaron (NBC Universal)
Subject: RE: file hashes

 

Also, just so you know, I'm going to be out of the office for a week
starting tomorrow. I just found out the place I'm staying doesn't have
internet, so my email access will be intermittent at best. I'll
probably go into some sort of withdrawal

 

From: Jay Mairs
Sent: Wednesday, May 23, 2007 12:24 PM
To: Skinner, Andrew (NBC Universal); Markham, Aaron (NBC Universal)
Subject: RE: file hashes

 

It's theoretically possible, but it would require a significant
development effort. Also, in my opinion, the method I described in my
previous email would get a more complete set of swarmed hashes than this
method would.

 

From: Skinner, Andrew (NBC Universal) [mailto:Andrew.Skinner_at_nbcuni.com]

Sent: Wednesday, May 23, 2007 10:18 AM
To: Jay Mairs; Markham, Aaron (NBC Universal)
Subject: RE: file hashes

 

I wasn't sure what you were using to identify swarms, so "swarm
information" was just a general label for the unique identifier. It
seems that sending the file hash still falls victim to the problemss
mentioned below.

 

 

Is it possible for computers outside the protection system to enter a
swarm and detect that a countermeasure computer is present?

 

________________________________

From: Jay Mairs [mailto:jay_at_mediadefender.com]
Sent: Wednesday, May 23, 2007 9:20 AM
To: Skinner, Andrew (NBC Universal); Markham, Aaron (NBC Universal)
Subject: Re: file hashes

 

What swarm information are you referring to? The hash is what we use to
uniquely identify a swarm, all the other file info like filename, etc.
is just useful descriptive info.

----- Original Message -----
From: Skinner, Andrew (NBC Universal) <Andrew.Skinner_at_nbcuni.com>
To: Jay Mairs; Markham, Aaron (NBC Universal) <Aaron.Markham_at_nbcuni.com>
Sent: Wed May 23 08:56:11 2007
Subject: Re: file hashes

Haha. I hope it wasn't a crayon...

Your setup sounds pretty complex. I hope you mind me throwing ideas your
way.

Would it be possible to have the protection machines act as scouts for
the reporting system? Instead of the protection system gathering all the
hashes and sending them to the control computers, what if they only sent
information about the swarm to the control machines? Then that
information could be used to sick the reporting system on the same swarm
the countermeasures computers are dealing with.

Now, I don't know the details about you setup, but that should reduce
the amount of traffic between the cm machines and the control machines
while delivering similar results. Do you think that something like this
would cause a huge development period or even be effective?

-----Original Message-----
From: Jay Mairs <jay_at_mediadefender.com>
To: Markham, Aaron (NBC Universal); Skinner, Andrew (NBC Universal)
Sent: Tue May 22 13:35:20 2007
Subject: RE: file hashes

My email program says I already replied to this email, which I didn't.
So if you got empty replies already, I apologize. I was out of the
office yesterday because my son stuck something up his nose and I had to
take him to urgent care. I guess we know where he gets his smarts
from;)

As far as getting hashes from the protection system, it's one of those
things that sounds easy, but in reality it will take a relatively large
development effort and have non-trivial consequences for the performance
of the protection system. I'll try to explain why without getting into
the hairy details.

The main problem is due to the distributed nature of the protection
system. It includes thousands of protection computers at various
datacenters which are controlled loosely by a smaller set of controlling
computers. The protection computers are pretty autonomous in that they
make a lot of decisions themselves, including which file hashes to
swarm. The controlling computers are giving the protection computers
higher level commands, such as adding new projects, changing project
metadata, removing projects, and getting per project counts of
protection events (spoofs, decoys, etc.). Each controlling computer is
communicating with hundreds or thousands of protection computers, so we
need to keep the per protection computer traffic down to a minimum. So,
we have a situation where the data you're looking for exists temporarily
on each of the many protection computers, but any attempt to bring that
raw data back to the controller computer will dramatically increase
traffic being sent to it, which will force us to redesign the hierarchy
of our controlling system.

A related problem is the fact that the controller computers keep track
of content owners (such as NBC Universal) but the individual protection
computers do not. They only have lower level information, such as
project keywords, filesize thresholds, etc. So, the fact that supply
numbers are low (2k-5k) on certain projects doesn't help, because the
computer couldn't distinguish between your projects and other projects.
Obviously, this aspect of the design could be changed, but it would
involve significant development.

Another problem is the dynamic nature of the protection. The swarms we
interdict on are determined on the fly by each protection computer and
are updated as their view of the network changes over time. This means
that the data would need to be updated relatively frequently. Which
means more data being sent to the controller computers, which gets back
to the first problem I described.

I'm not trying to be all doom and gloom here, I'm just trying to spell
out the technological problems we're facing, despite the fact that the
problem seems like a simple one. We've been trying to think ways to get
you the type of data you're looking for without tearing apart our whole
protection system. All of the problems I mentioned above about the
protection system are not problems for the data collection system. It
is designed from the ground up to deal with large amounts of raw data.
I think we can get you a list of hashes from the data collection system
that is pretty close to the list of swarmed hashes by applying the same
logic the protection computers use to determine which hashes to hit.

If you're interested in that data, we should be able to add it to the
feed within 2 weeks.

Regards,
Jay Mairs

-----Original Message-----
From: Markham, Aaron (NBC Universal) [mailto:Aaron.Markham_at_nbcuni.com]
Sent: Monday, May 21, 2007 10:57 AM
To: Skinner, Andrew (NBC Universal); Jay Mairs
Subject: RE: file hashes

Heh... we shouldn't presume to tell them how to network their protection
system... that shit has to be difficult... anyway, Jay, since the
"supply" numbers are typically pretty low per project - say 2-5k on
average - why can't you collect those file hashes?

Keep in mind that we're looking at doing some direct measurement on
countermeasures effectiveness such that we're monitoring particular
swarms and watching individual peers as they collect up to 100% of the
file and then drop off. We'll monitor these individuals across multiple
swarms if they happed to try to download more than one version of a
file. That way we can have a more accurate picture of effectiveness.
Just because 9 out 10 files where fake most of the time doesn't mean we
were effective. If 50% of the user population defeats us by downloading
multiple files at once then we have a problem... if this is only 10% of
the population then it's not so bad.

So, if we do this kind of analytics it would be good to know that in
swarms where it appears that many of the users are getting the full file
versus swarms where everyone seems to be very slow at getting the full
file that countermeasures were involved. The only way to know this if
Mediadefender can tell us if they've interdicted a particular swarm. If
we find swarms that you didn't interdict (or aren't currently
interdicting) then we'd feed that info back to you automatically.

-----Original Message-----
From: Skinner, Andrew (NBC Universal)
Sent: Monday, May 21, 2007 10:38 AM
To: Jay Mairs; Markham, Aaron (NBC Universal)
Subject: RE: file hashes

If logging can't be enabled on the countermeasure servers, how about
routing all those machines through an internal proxy and then tracking
the connections that way?

-----Original Message-----
From: Jay Mairs [mailto:jay_at_mediadefender.com]
Sent: Thursday, May 17, 2007 10:32 AM
To: Markham, Aaron (NBC Universal)
Cc: Skinner, Andrew (NBC Universal)
Subject: RE: file hashes

We don't keep a history of individual file hashes/IP addresses in our
protection system because the computers in our protection system are
already pushed close to their limits. Any deep data collection (file
hashes, IP addresses, etc.) on our protection system would negatively
affect our protection effectiveness. Because of this problem, we
created a separate data collection system in order to collect more raw
data.

The data feed files from our data collection system contain raw data for
supply and demand (including IP address) on the respective networks.
The data collection system only collects supply and demand, it is not
connected or related to our protection system in any way, so there is no
spoof, decoy, or interdiction data associated with the data feed files
for each network.

-----Original Message-----
From: Markham, Aaron (NBC Universal) [mailto:Aaron.Markham_at_nbcuni.com]
Sent: Wednesday, May 16, 2007 1:49 PM
To: Jay Mairs
Cc: Skinner, Andrew (NBC Universal)
Subject: file hashes

Do you record the file hashes (for edonkey in particular) for every
swarm you interdict? Is that in the supply feed?
Received on Fri Sep 14 2007 - 10:55:54 BST

This archive was generated by hypermail 2.2.0 : Sun Sep 16 2007 - 22:19:46 BST