Tiny URL

Introduction

It is like , you give a long url to the system , and you will get a short url which can be shared with others to the access the main url it is more like When you click the short url , it will redirect to the respective main long url. The main advantange of tiny url is , when you want to share a url in a chat or social media , having a very long url is messy, so to keep the message clean we will use short url which will redirect us to the main url.

Example 
Long url : https://www.linkedin.com/in/jaswanth-reddy-p/
Short url : https://tinyurl.com/389zf6pb

Functional and Non Functional Requirements

Funtional Requirements
- Core Requirements
  - System should be able to create a short url given long url
  - User cant delete the url , but can set a expire date to the short url
- Optional Requirements
  - Allow user to create custom short url of max length 16
  - Give analytics for the clicks
Non Funtional Requirements
- avoid short URL collision
- low latency
- scalability
- security
- high availability

Traffic and Storage Estimation

Let us assume

1 million write requests per day
100:1 read to write ratio , that means for every 1 write request we serve 100 read request
Avg length of long url is 50 characters

Avg Writes per second : 1 million / 24*60*60 = 12 urls / sec
Avg eads per second : writes per second * 100 = 12*100 = 1200 /sec

Our Database Contains :

Filed	Bytes
Long Url	50
Short Url	7
Expire Date	8
Click Count	4

Total 50 + 7 + 8+ 4 ~= 70 bytes per row
Per year assume we will create around 365 million urls = 30 million *70 = 25 gb per year
25 gb is very easy in the current hardware availability and easyly handle by single system

API Design

Post : / api/service/ shorten

{
     long url* : <LONG URL>
     Expire date : <EXPIRE DATE>
}

Return :

{
   Short url : <SHORT URL>
}

GET : / {short url }

Return : long URl corresponding to the short url

High Level Design

How it works

Why 301 redirect ? Cant we use 302 ?

301 Redirect : A 301 redirect shows that the requested URL is “permanently” moved to the long URL. Since it is permanently redirected, the browser caches the response, and subsequent requests for the same URL will not be sent to the URL shortening service. Instead, requests are redirected to the long URL server directly.
302 Redirect : A 302 redirect means that the URL is “temporarily” moved to the long URL, meaning that subsequent requests for the same URL will be sent to the URL shortening service first. Then, they are redirected to the long URL server.

DeepDive

Cool , now we are able to short the url long and serve the client. But How are we handeling the uniqueness of shorturl? is this system fast , scalable ? does it handle huge traffic? , does it handle malicious attacks?

First we will go through how to create unique url:

Bad Way

Generating a random 7 digit number and encode it to base 62. Initial this will handle the task with simple check and assign logic , but the amount of data we are handeling is very high , as the data increase the collision rate also increase which will cause the high latency to write .

Good way

Have a global counter and increase it every time you are create a new url use the current count number as id and encode it to base 62 and increase the count by 1 . This will help you to handle url collision.

Best way

Use a single counter ,it will block us from scaling the database , but using multiple counter will cause url collision

one way to handle is , having a global counter and every time we create a short url , we will use the global url and create the short url. for this to work , we will store the global counter in the redis which we send request to return unique_id . Here we have two problem ,

one is we are sending too many requests for redis to get the unique value , so the load on the system will be high and there is SPOF
another is when there are concurent request we might return same value to different write servers which will casue the url collision

we can tackle these by assiging id batches to each write server , let say the coutner is at 1,000,000 and let say we have 5 write servers , we will assign 1000 ids for each server that mean , 1,000,000-1,001,000 , 1,001,000-1,002,000…….this will solve two problems, scence we are telling the server which ids it can take , so it wont need to look at the global coutner untill these 1000 are completed , and these 1000 ids are saved in the specific server, it wont casue problem when parallel request are sent. Just to be safe we can send the requests through a queue if needed which will aviod collison.

But when the redis is down we will lose the count value, in that case either we need to use slave if we are having one or we take the heightest number in the temp mini batch accorss the servers.

Another way is : We can use a distributed service Zookeeper to solve the various challenges of a distributed system like a race condition, deadlock, or particle failure of data. Zookeeper is basically a distributed coordination service that manages a large set of hosts. It keeps track of all the things such as the naming of the servers, active database servers, dead servers, configuration information (Which server is managing which range of counters)of all the hosts. It provides coordination and maintains the synchronization between the multiple servers.

how to maintain a counter for distributed hosts using Zookeeper.

From 3500 Billion URLs combinations take 1st billion combinations.
In Zookeeper maintain the range and divide the 1st billion into 100 ranges of 10 million each i.e. range 1->(1–1,000,0000), range 2->(1,000,0001–2,000,0000)…. range 1000->(999,000,0001–1,000,000,0000) (Add 100000000000 to each range for counter)
When servers will be added these servers will ask for the unused range from Zookeepers. Suppose the W1 server is assigned range 1, now W1 will generate the tiny URL incrementing the counter and using the encoding technique. Every time it will be a unique number so there is no possibility of collision and also there is no need to keep checking the DB to ensure that if the URL already exists or not. We can directly insert the mapping of a long URL and short URL into the DB.
In the worst case, if one of the servers goes down then only that range of data is affected. We can replicate data of master to it’s slave and while we try to bring master back, we can divert read queries to it’s slaves .
If one of the database reaches its maximum range or limit then we can move that database instance out from active database instances which can accept write and add a new database with a new a new fresh range and add this to Zookeeper. This will only be used for reading purpose
The Addition of a new database is also easy. Zookeeper will assign an unused counter range to this new database.
We will take the 2nd billion when the 1st billion is exhausted to continue the process.

which database is best SQL/NOSQL?

It really doesnot matter , Our databse is simple and no table joining , and need scalling , So choosing SQL is not the best option.

We need strong /fast read operation , if we choose SQL DB like postgress, the search operation took O(log(n)) B-tree index which fast but in nosql like dynamodb it takes O(1) key-value hash and also scaling is easier in No sql compare to sql .

Deleting the Shorturls is also easy , as dynamo db have TTL which will delete the expired urls , no need for background jobs or manual cleanup But using this will cause us lose the short url id to reuse , we can fix it by manully implimenting the TTL and svaing the deleting short url ids.

And we cant use these ids directly as we are assigning the short url base on the Counter , either we totally lose these ids , which wont matter much if we have enough ids to handle for years without using the used urls.

But if we are getting very high traffic and it should be great if we keep track of deleted keys as well , then we need to save the delted/expired keys/urls into a queue/DB and before asking the coutner to give a unique value , we will ask the db if we have any ids that we can use , if so we will use it else we will create a new id from counter.

Cache

We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcached, which can store full URLs with their respective hashes. Before hitting backend storage, the application servers can quickly check if the cache has the desired URL.

We can start with 20% of daily traffic and, based on clients’ usage patterns, we can adjust how many cache servers we need. As per our assumptions we will get around 1200 request per sec - > 1200*24*60*60 *70 bytes /day = 7gb per day 20 percent of 7gb is approx. 1.5gb.Since a modern-day server can have 256GB memory, we can easily fit all the cache into one machine. All our assumptions are small , if we have high traffic and if the url length is very long, then these values changes drastically .

how to handle which urls we need to keep in the cache or When the cache is full which url we need to delete?

Every time we get a get request we will inko cache if the url is not preent we will redirect the request to DB and get the short url. It is simple but if we are doing like this the cache will full at some point , then which url we choose to remove ? For that we can use Least Recently Used (LRU) algo for our system. Under this policy, we discard the least recently used URL first. How do we impliment it ..we need to make sure the read and write and delete operations should be fast for this.

We can choose a linked list. but which one ->single or double ? to have fast delete double linked list is better , but search operation in linked list is O(N) which is not good

so we will use linked hash map which gives search oepration in O(1) and in linked list insert and delete operation takes in O(1)

Every time we get a short url , we will see in cache if it is present we will move it to first which take O(1) and if not present in cache we will check in DB and add that url at top of list. if cache is full , we will remove from the end which are least recently used urls.

System design:

NO of servers for read > No of servers for write. As in this case we have high read to write ratio. But keeping different servers for read to write is fully choice. We can use single server as well Keeping these independent will help us to scale the service independently based on the demand.

Now does it handle Dos ? A potential security problem we could face is that malicious users send an overwhelmingly large number of URL shortening requests. Rate limiter helps to filter out requests based on IP address or other filtering rules.

Tiny URL

Introduction

Functional and Non Functional Requirements

Funtional Requirements

Core Requirements

Optional Requirements

Non Funtional Requirements

Traffic and Storage Estimation

API Design

High Level Design

DeepDive

Bad Way

Good way

Best way

which database is best SQL/NOSQL?

Cache