Archive

Archive for the ‘Uncategorized’ Category

P2P Backup System w/o SPOF for Work Group

July 29th, 2009 Bali No comments

This is also one of my half-completed ideas years ago. It was recalled recently by two stories:

Stories

#1: One of my team mates lost his Outlook email archive due to a mistaken operation. He is very upset because all of his emails in past two years are gone, unrecoverablly. Suggested Outlook mail file is 2GB, but people can easily drive it up to 10GB and often notice this too late. As what he said in MSN signature, “my archive, 555…”

#2: While I was browsing one of my favorite bloggers – Brad Abrams’s blog, ran into a shocking post, Help! My hard disk crashed. He can afford a new disk, but he need the data – especially “the pictures of the kids recent birthday party”. In the comments, people suggested various ways to recover it, unfortunately looks like no luck.

This leads to the most basic questions: why people don’t backup data even if they know they are in the risk of losing them? Probably because of:

1) The chance is very tiny

2) Current backup solutions are not “good enough”

For 1), although it might be true, but remember the result might be cataclysmic. Let us take a look at what is going on about 2).

Existing Backup Solutions

Traditional C/S – Most of current backup solutions are based on traditional C/S architecture. Assume one use 40% of 500GB disk drive on average; it is still a big capacity for enterprise with thousands of staff. Backup time is also another concern. Let me illustrate this with some math. For a backup server with two 1-Gbit NICs, its max throughput is about 200MB. If backup window(people are sleeping) is 10 hours, it means it can back up about 7.2TB(2*100*3600*10MB) data. Obviously it is hardly scalable along with business expansion.

Online backup – People can turn to AWS S3, or Azure SQL Data Services to simplify things on their own side. But this doesn’t make things even better if you look at current network speed. Calculating data is much easier than moving data around.

My P2P Backup System

p2p backup comes with several inherent advantages in handy:

1) No additional hardware expense. People often have at least 30% capacity left; we can leverage that for backup space.

2) No obvious network bottleneck.

Some people might also list “no single point of failure”, but that is not true. Let us take a look at how BT works, a general P2P network.

Basically here are what happens in P2P network – peers need talk to a central coordination server(i.e., Tracker in BT system) to get info about other peers, and then talk to individual peers about real data exchange. These two steps are essential. Single Point of Failure(SPOF) can happen in the coordination server. You may argue that you can enhance its reliability by techniques like mirroring, but does it look like over-heavyweight? We have many reliable services running by full time IT guys, why do we bother to re-invent the wheel? Several services can be utilized, such as AD, Sharepoint, exchange, which allow customized data writing. Data needed written into a central place would be:

1) Peer list by <IP, port>

2) Other meta data such as backup network name, Software RAID level

Of course, we need do compression, encryption, incremental backup, scheduled backup in the solution. I may do investigation later and update the post.

Also see other backup solutions

http://www.storegrid.com/index.html – select a folder, then backup, traditional

www.streamload.com – Large net disk, Upload your files to streamload, then view it, or email it to someone else.

http://www.beinsync.com/ – Access you computer: install a mini PHP web server onto it.

http://base.google.com/base/about.html – In about 15 minutes, your item will have a unique web address and be visible to the world.

http://www.openomy.com/openomy is an online file storage system designed to be a platform for Web 2.0 applications, built by Ian Sefferman

Memory Leaks Demo & Detection in .NET Application

July 29th, 2009 Bali No comments

Memory leaks are always headache of developers. Do .NET developers no longer bother to worry about memory leaks because of garbage collection? Yes and NO. GC periodically find objects that cannot be accessed in the future and then reclaim the resources used by the objects. GC achieves this by maintaining a list of references to live objects. When this mechanism is broken, memory leak happens.

There are many reasons to leak memory. In addition to calling unmanaged code from managed code, another one of general cases is about event handler. If you do this:

Foo.FooEvent += new EventHandler(MemoryLeaksHere.Method);

When you complete using MemoryLeaksHere, but you are still using Foo, then MemoryLeaksHere will still remain alive as well. MemoryLeaksHere object will leak memory as a result of failing to GC.

Let us take a look at one simple example first.

using System;

namespace MemoryLeakSample

{

class Foo

{

public static Foo myFoo;

public event EventHandler FooEvent;

public Foo()

{

myFoo = this;

}

public void FooMethod()

{

MemoryLeaksHere memLeak = new MemoryLeaksHere();

memLeak.TryQuit();

}

public void FireEvent()

{

FooEvent(null, null);

}

static void Main(string[] args)

{

Foo foo = new Foo();

for (int i = 0; i < 5; ++i)

{

foo.FooMethod();

}

GC.Collect();

GC.WaitForPendingFinalizers();

GC.Collect();

Console.WriteLine(“Check memory leak here.”);

}

}

/// <summary>

/// This object will cause memory leak

/// </summary>

public class MemoryLeaksHere

{

public MemoryLeaksHere()

{

Foo.myFoo.FooEvent += new EventHandler(OnMyFooEventFired);

Console.WriteLine(“\nObject-{0}: Construct. Subscribe.”, this.GetHashCode());

}

~MemoryLeaksHere()

{

Console.WriteLine(“Object-{0}: Deconstruct.”, this.GetHashCode());

}

public void TryQuit()

{

Console.Write(“Object-{0}: leak me?”, this.GetHashCode());

string input = Console.ReadLine();

if (string.Equals(input, “no”))

{

Foo.myFoo.FooEvent -= new EventHandler(OnMyFooEventFired);

Console.WriteLine(“Object-{0}: Unsubscribe.”, this.GetHashCode());

}

else

{

Console.WriteLine(“Object-{0}: Not Unsubscribe”, this.GetHashCode());

}

}

private void OnMyFooEventFired(object sender, EventArgs e)

{

// Do something

}

}

}

In MemoryLeaksHere object’s constructor, Foo starts to hold a reference to MemoryLeaksHere by registering event handler. In MemoryLeaksHere.TryQuit(), if we don’t unregister, memory leak will happen.

To be more intuitive, you can copy/paste sample code to VS2008, and then enable unmanged code debugging by following:

Project->Properties->Debug->Enable Unmanaged Code debugging

Now set a breakpoint at Check memory leak here”, and start build/debug. When being asked leak me or not, you can choose either yes or no. For example:

Here, looks like we leak two of them. Finally app will hit the breakpoint and stop. At this point, we can go to VS immedate window to load sos.dll, and then check how many objects in the heap:

!load sos.dll

extension C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\sos.dll loaded

!dumpheap -type MemoryLeaksHere

PDB symbol for mscorwks.dll not loaded

Address MT Size

0132e7d0 00983104 12

0132eba0 00983104 12

total 2 objects

Statistics:

MT Count TotalSize Class Name

00983104 2 24 MemoryLeakSample.MemoryLeaksHere

Total 2 objects

So now we know there are two object instances are not recycled. Why are they not GC-ed? Because someone has a reference to them. Choose one of them, and use gcroot command.

!gcroot 0132e7d0

Note: Roots found on stacks may be false positives. Run “!help gcroot” for

more info.

Error during command: Warning. Extension is using a callback which Visual Studio does not implement.

Scan Thread 7592 OSTHread 1da8

ESP:12f434:Root:01312d48(MemoryLeakSample.Foo)->

0132f704(System.EventHandler)->

0132f6ec(System.Object[])->

0132e7dc(System.EventHandler)->

0132e7d0(MemoryLeakSample.MemoryLeaksHere)

Scan Thread 4704 OSTHread 1260

Now we can see that MemoryLeakSample.Foo is still referencing MemoryLeakSample.MemoryLeaksHere via event handler. If it is not 5 iterations, image what would happen if every incoming request results in a slice of memory leak… Soon or later, you online service will be down.

See also:

http://www.codeproject.com/KB/dotnet/Memory_Leak_Detection.aspx

http://blogs.msdn.com/jgoldb/archive/2008/02/04/finding-memory-leaks-in-wpf-based-applications.aspx

http://blogs.msdn.com/calvin_hsia/archive/2008/04/11/8381838.aspx

http://www.automatedqa.com/techpapers/net_allocation_profiler.asp

http://blogs.msdn.com/greg_schechter/archive/2004/05/27/143605.aspx

Designing Your Own Recent Posts Widget for MSDN Blog

July 29th, 2009 Bali 1 comment

In my MSDN blog, I need “Recent Posts”, but I don’t need archive side bar. After having played with template for a while, still no luck. Hmmm, looks like I have to DIY it. Fortunately in News sidebar, you can fill in raw html including JavaScript. Then next question is where we can retrieve post tiles. The immediate idea is from current DOM document. Through experiments, I found this is impossible because the DOM is not fully loaded yet when the script is executed. Later, I figured it out that all posts title can be gotten from RSS. For my blog, the address is http://blogs.msdn.com/bali_msft/rss.xml. One thing worth noticing is the fact that RSS in MSDN blog is not up to date – Your post will not instantly appear in the RSS. After I get all posts in RSS format, things became much easier. And then I go ahead to add more interesting things:

  • Posts background use two colors in turn
  • Show a new tag for posts less than 3 days old
  • Show latest 8 posts only
  • Show posts’ age

So, the final thing will look like this:

If you find it is useful, feel free to paste below code to you blog’s news section. Note to customize “Configurable params” to your own needs and leave other code intact. It works well at least in my IE 8 and Firefox 3.0.6.

<div id=”RecentPosts”></div>

<Script>

// Configurable params

var recentPostNumber = 8;

var rssUrl = “http://blogs.msdn.com/bali_msft/rss.xml”;

var title = “My Recent Posts”;

var newPostAgeInHour = 72;

// Cacluate age of one post. It is all about getting time span in Javascript

// return formate: x min; x hour y min, x day y min, x days, x yeas (ago)

// Refer to: http://www.w3schools.com/jsref/jsref_obj_date.asp

function calculateAge(postDate)

{

var ret = “fresh!”;

CurrentDate = new Date();

TimeSpan = new Date(CurrentDate – postDate);

var mySpanArray = new Array();

mySpanArray[0] = TimeSpan.getUTCFullYear()-1970;
mySpanArray[1] = TimeSpan.getUTCMonth();
mySpanArray[2] = TimeSpan.getUTCDate()-1;
mySpanArray[3] = TimeSpan.getUTCHours();
mySpanArray[4] = TimeSpan.getUTCMinutes();

var TimeSpanTagArray_1 = new Array(“years”, “months”, “days”, “hours”, “minutes”);

var TimeSpanTagArray_2 = new Array(“year”, “month”, “day”, “hour”, “minute”);

// Starting from non-zero element and pick two significant values

for(i = 0; i < mySpanArray.length; i++) {

if(mySpanArray[i] != 0) {

var correctTag = (mySpanArray[i] == 1)?(TimeSpanTagArray_2[i]):(TimeSpanTagArray_1[i]);

ret = mySpanArray[i] + ” “ + correctTag;

if(i+1 < mySpanArray.length && mySpanArray[i+1] != 0) {

correctTag = (mySpanArray[i+1] == 1)?(TimeSpanTagArray_2[i+1]):(TimeSpanTagArray_1[i+1]);

ret = ret + “, “ + mySpanArray[i+1] + ” “ + correctTag;

}

break;

}

}

return ret;

}

// Display the recent posts

// Refer to:

// http://www.w3schools.com/DOM/dom_node.asp

// http://www.w3schools.com/DOM/dom_methods.asp

function displayPosts (xmldoc)

{

var newTag = “<SPAN style=\”COLOR: red\”>(New!)</SPAN>”;

var posts = xmldoc.getElementsByTagName(“item”);

var displayText = “<h3>” + title + “</h3><UL>”;

if (posts.length < recentPostNumber) {

recentPostNumber = posts.length;

}

for(var i = 0; i < recentPostNumber; i++)

{

PostTitle = posts[i].firstChild.firstChild.nodeValue;

PostLink = posts[i].firstChild.nextSibling.firstChild.nodeValue;

PostDateStr = posts[i].firstChild.nextSibling.nextSibling.firstChild.nodeValue;

PostDate = new Date(PostDateStr);

CurrentDate = new Date();

// Calculate age

var _PostAge = calculateAge(PostDate);

var PostAge = “<SPAN style=\”font-size: 80%; color: black\”> (“ +_PostAge + ” ago)</SPAN>”;

// Show a new tag for posts happening last days defined by ‘newPostAgeInHour’

var myNewTag = “”;

if((CurrentDate.getTime() – PostDate.getTime())/1000/60 < newPostAgeInHour * 60) {

myNewTag = newTag;

}

// Get background color

var BKColor = (i%2 == 0)?(“#B8CCE4″):(“#DBE5F1″);

displayText = displayText + “<LI style=\”background-color:” + BKColor + “\”><A href=\”" + PostLink + “\”>” + myNewTag + PostTitle + PostAge + “</A></LI>”

}

displayText = displayText + “</UL>”;

var target = document.getElementById(“RecentPosts”);

target.innerHTML=displayText;

}

// Call back

function complete(){

if (req.readyState == 4) {

if (req.status == 200) {

displayPosts (req.responseXML);

}

}

}

// Initial async call

function getPosts()

{

if (window.XMLHttpRequest) {

req = new XMLHttpRequest();

}else if (window.ActiveXObject) {

req = new ActiveXObject(“Microsoft.XMLHTTP”);

}

if(req){

req.open(“GET”, rssUrl, true);

req.onreadystatechange = complete;

req.send(null);

}

}

// Entry point

getPosts();

</Script>

Searching For a Number in Shifted Sorted Array within O(log(n)) time

July 29th, 2009 Bali No comments

Run into the algorithm problem long time ago. Now post my answer here. A sorted array, say: {1,2,3,4,5,6,7,8,9,10,11,12}, do right rotate through carry unknown times, and then it might become: {6,7,8,9,10,11,12,1,2,3,4,5}. Now we need get the index of a given number, say 4, from the array within O(log(n)) time. Apparently a 8-year-old can get it done with O(n) time.

We can think of it this way: take the middle element of array, if target is found, fine; if not, and then array become two parts, one is sorted array, the other is shifted sorted array. As illustrated as below diagram:

If the target falls into the sorted array half, we can simple do a binary search; otherwise, repeat this operation in the other half in recursive way. You can see this is divide-and-conquer algorithm. Obviously this is O(log(n)).

Code

//

// A typical binary search implementation

//

int _BinarySearch(unsigned int ShiftedArray[], unsigned int start, unsigned int end, unsigned int target)

{

// Not found

if( start == end && ShiftedArray[start] != target) {

return -1;

}

unsigned int middle = start + (end – start)/2;

if(target == ShiftedArray[middle])

{

return middle;

} else if (target > ShiftedArray[middle]) {

return _BinarySearch(ShiftedArray, middle + 1, end, target);

} else {

return _BinarySearch(ShiftedArray, start, middle – 1, target);

}

}

//

// Select a given number from shifted array.

// ShiftedArray is something like = {6,7,8,9,10,11,12,1,2,3,4,5}

// If found, return index of the number; if not, reutrn -1

// Require log(N)

//

int SearchShiftedArray(unsigned int ShiftedArray[], unsigned int start, unsigned int end, unsigned int target)

{

// Start meets end

if( start == end && ShiftedArray[start] != target) {

return -1;

}

unsigned int middle = start + (end – start)/2;

if(target == ShiftedArray[middle])

{

return middle;

} else if(ShiftedArray[middle] < ShiftedArray[start]) { // Right half is sorted linearly

if((target > ShiftedArray[middle]) && (target <= ShiftedArray[end])) {

return _BinarySearch(ShiftedArray, middle + 1, end, target);

} else {

return SearchShiftedArray(ShiftedArray, start, middle-1, target);

}

} else { // Left half is sorted linearly

if((target >= ShiftedArray[start]) && (target < ShiftedArray[middle])) {

return _BinarySearch(ShiftedArray, start, middle – 1, target);

} else {

return SearchShiftedArray(ShiftedArray, middle + 1, end, target);

}

}

}

Test cases

Positive: {6,7,8,9,10,11,12,1,2,3,4,5}, target = 3, target = 8

Negative: {6,7,8,9,10,11,12,1,2,3,4,5}, target = 0, target = 13

Boundary: {6,7,8,9,10,11,12,1,2,3,4,5}, target = 6, target = 5

Exceptional: {…max}, target = max

One more interesting thing is the statement that “only about 10 percent of the professional programmers implemented binary search correctly.” Do you know why? Check this.

To Next Cuil

July 29th, 2009 Bali No comments

Cuil, another so-called Google killer, is at its last gasp. I just knew it. I am not predicting present. Cuil is not the first one, and apparently not the last. For upcoming cuils, here are my words.

Brand. Brand. Brand.

For many people, word of Google has close sentimental connection with bunch of splendid words such as cool, innovation, unselfish, impartial, revolution, and powerful, etc… With brand, Google claims that “People don’t work at Google for the money. They work at Google because they want to change the world!”. With brand, debut of Google’s every new service always arouses buzzes, but seldom notices that Live also has compelling equivalence. With brand, people think only Google can provide best results, but often they can’t tell who is search provider when presented anonymous results set. It is very interesting to take a look at curve of Cuil’s daily unique visitors:

Curl's Daily Unique Visitors

At launch momentum, people rushed to see what this Google killer looks like because of Google’s brand. Ridiculous? Not actually. It is everyone’s inherent attributes as people love to check out events of small probability such as Shoes thrown at Bush, one crazy million-dollar idea. As part of branding strategy, naming is essential. Cuil might not a good name actually. Let me share a story of mine. Back to several years ago, a group of my friends decided to build a website aimed to provide 3rd service for franchising, called JiaMeng in Chinese. The guys with solid academic management background came with the domain name of 51franchise.com. It turned out a real trouble – hard to explain to customers, not localized. Even ordinary college students don’t know the word franchise, not to mention clients with much less schooling. So, ditu.live.com for Chinese is much better than chinamap.live.com if you take a look at average education level of internet users. All in all, BRAND works like religion, and it takes lifetime to build.

Brand -> prouducts

“A Google approach to email” – see how brand helps product marketing.

Infrastructure

GFS. BigTable. MapReduce. They can be competitive advantages. With these put in place, Google can roll out new internet services faster, cheaper, and at scale at few others can compete with. They are designed solely for Internet services. Users quit quickly after dissatisfied performance experience in Cuil. Microsoft software is mainly for an enterprise, supporting 100K concurrent users is “good enough”, but it is far more perfect in internet scenario.


Understand/Repsect Customers


There is no one-size-fits-all solution given the growingly diversified market. Of course you can educate customers, but never expect to change their inherent attributes coming from culture/history/economic development level. If you doubt this claim, check out this article: Search site moves at the speed of China, which reports, “But appreciating such cultural differences is what Baidu.com Inc.’s chief financial officer, Shawn Wang, says gives the Chinese search giant unique insight into the country’s 1.3 billion people as it competes with American rivals such as Google Inc. and Yahoo Inc.” As a result:

Baidu beats Google in China market

Culture

Per Wikipedia, culture means the set of shared attitudes, values, goals, and practices that characterizes an institution, organization or group. Google’s business is built on top of internet, so its organization/knowledge base is built for the internet, just like Microsoft is built for software, mainly enterprise software. I met strong feature PM with deep knowledge needed for enterprise software, say reporting, admin UI, DB admin UI, and information work flow. They understand their customers so much after years of interactions with them. It takes time to accumulate. Top-down hierarchy, heavyweight development process, years of in-house development can hardly catch up with the pace of internet evolution. The same thing is applied to Google – I am equally not optimistic if Google step into enterprise software because of the same reason – culture, enterprise’s DNA.

Web Competition Strategy

What is Cuil’s selling point? (1) Fancy UI. UI is critical for adoption and usage, but it hardly provides a moat. This is provided by two case studies of Apple computer of the nineties and the “X window” system on *nix OS. Both these systems with more attractive UI couldn’t beat windows OS with lower cost and rich applications available. (2) More relevant result. This is an ambiguous area which lacks of widely accepted measure criteria. (3) Cheaper solution. There is a question of sunk cost, of course you can claim you are 1/10 cheaper once reaching Google’s current scale. None of these is compelling from users’ point of view. Why do users bother to go to your site instead? One of the significant differences between web service (say, search) and traditional software business(say, DB) is purchasing decision making process. DB vendors can send to salesmen to target customers’ office and argue the deal. Only quite a few key persons have the final call. They are more analytical, love data. As comparison, everyone can be customers of search, we are more emotional. If I don’t miss anything, looks like the best strategy to monetize Cuil is to be acquired by Google.

No chance to win in search?

Definitely No. But you are doomed to fail if following essential parts are missed:

  1. Remember brand. Remember “winners take all”.
  2. Build your DNA towards internet. DNA = SUM(people, team arch, process, knowledge, …)
  3. Put infrastructure in place. This is the way to help turn your idea into profitable traffic. Not scale-up, scale-out instead.
  4. One thumb rule to compete with dominant market leader
    • Avoid playing games whose rules are set by opponents. You can hardly win. In this case, better search engine defined by Google are faster, relevant results, simple UI, magic algorithm, PB of data, … Let us think of solving same problems with different approaches. Why search? Help explore and share information. If someone tries to solve this problem by following Yahoo’s tail light to build yet another portal, he has little change to take off. Another example is download – P2P technology solved the download problem without adding more expensive servers/bandwidth.
    • Attack opponents’ weak points. Google is designed to search everything, but it may not be good at all vertical industries, say shopping. Nibble at its market share if we can’t win in head-to-head way.
  5. Before rolling up sleeves, why we have to win? Why not step away and go find next big thing? Let Google be Google.