It is like , you give a long url to the system , and you will get a short url which can be shared with others to the access the main url it is more like When you click the short url , it will redirect to the respective main long url. The main advantange of tiny url is , when you want to share a url in a chat or social media , having a very long url is messy, so to keep the message clean we will use short url which will redirect us to the main url.
Example
Long url : https://www.linkedin.com/in/jaswanth-reddy-p/
Short url : https://tinyurl.com/389zf6pb
Let us assume
Avg Writes per second : 1 million / 24*60*60 = 12 urls / sec
Avg eads per second : writes per second * 100 = 12*100 = 1200 /sec
Our Database Contains :
| Filed | Bytes |
|---|---|
| Long Url | 50 |
| Short Url | 7 |
| Expire Date | 8 |
| Click Count | 4 |
Total 50 + 7 + 8+ 4 ~= 70 bytes per row
Per year assume we will create around 365 million urls = 30 million *70 = 25 gb per year
25 gb is very easy in the current hardware availability and easyly handle by single system
Post : / api/service/ shorten
{
long url* : <LONG URL>
Expire date : <EXPIRE DATE>
}
Return :
{
Short url : <SHORT URL>
}
GET : / {short url }
Return : long URl corresponding to the short url
How it works
Why 301 redirect ? Cant we use 302 ?
Cool , now we are able to short the url long and serve the client. But How are we handeling the uniqueness of shorturl? is this system fast , scalable ? does it handle huge traffic? , does it handle malicious attacks?
First we will go through how to create unique url:
Bad Way
Generating a random 7 digit number and encode it to base 62. Initial this will handle the task with simple check and assign logic , but the amount of data we are handeling is very high , as the data increase the collision rate also increase which will cause the high latency to write .
Good way
Have a global counter and increase it every time you are create a new url use the current count number as id and encode it to base 62 and increase the count by 1 . This will help you to handle url collision.
Best way
Use a single counter ,it will block us from scaling the database , but using multiple counter will cause url collision
one way to handle is , having a global counter and every time we create a short url , we will use the global url and create the short url. for this to work , we will store the global counter in the redis which we send request to return unique_id . Here we have two problem ,
we can tackle these by assiging id batches to each write server , let say the coutner is at 1,000,000 and let say we have 5 write servers , we will assign 1000 ids for each server that mean , 1,000,000-1,001,000 , 1,001,000-1,002,000…….this will solve two problems, scence we are telling the server which ids it can take , so it wont need to look at the global coutner untill these 1000 are completed , and these 1000 ids are saved in the specific server, it wont casue problem when parallel request are sent. Just to be safe we can send the requests through a queue if needed which will aviod collison.
But when the redis is down we will lose the count value, in that case either we need to use slave if we are having one or we take the heightest number in the temp mini batch accorss the servers.
Another way is : We can use a distributed service Zookeeper to solve the various challenges of a distributed system like a race condition, deadlock, or particle failure of data. Zookeeper is basically a distributed coordination service that manages a large set of hosts. It keeps track of all the things such as the naming of the servers, active database servers, dead servers, configuration information (Which server is managing which range of counters)of all the hosts. It provides coordination and maintains the synchronization between the multiple servers.
how to maintain a counter for distributed hosts using Zookeeper.
It really doesnot matter , Our databse is simple and no table joining , and need scalling , So choosing SQL is not the best option.
We need strong /fast read operation , if we choose SQL DB like postgress, the search operation took O(log(n)) B-tree index which fast but in nosql like dynamodb it takes O(1) key-value hash and also scaling is easier in No sql compare to sql .
Deleting the Shorturls is also easy , as dynamo db have TTL which will delete the expired urls , no need for background jobs or manual cleanup But using this will cause us lose the short url id to reuse , we can fix it by manully implimenting the TTL and svaing the deleting short url ids.
And we cant use these ids directly as we are assigning the short url base on the Counter , either we totally lose these ids , which wont matter much if we have enough ids to handle for years without using the used urls.
But if we are getting very high traffic and it should be great if we keep track of deleted keys as well , then we need to save the delted/expired keys/urls into a queue/DB and before asking the coutner to give a unique value , we will ask the db if we have any ids that we can use , if so we will use it else we will create a new id from counter.
We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like Memcached, which can store full URLs with their respective hashes. Before hitting backend storage, the application servers can quickly check if the cache has the desired URL.
We can start with 20% of daily traffic and, based on clients’ usage patterns, we can adjust how many cache servers we need. As per our assumptions we will get around 1200 request per sec - > 1200*24*60*60 *70 bytes /day = 7gb per day 20 percent of 7gb is approx. 1.5gb.Since a modern-day server can have 256GB memory, we can easily fit all the cache into one machine. All our assumptions are small , if we have high traffic and if the url length is very long, then these values changes drastically .
how to handle which urls we need to keep in the cache or When the cache is full which url we need to delete?
Every time we get a get request we will inko cache if the url is not preent we will redirect the request to DB and get the short url. It is simple but if we are doing like this the cache will full at some point , then which url we choose to remove ? For that we can use Least Recently Used (LRU) algo for our system. Under this policy, we discard the least recently used URL first. How do we impliment it ..we need to make sure the read and write and delete operations should be fast for this.
We can choose a linked list. but which one ->single or double ? to have fast delete double linked list is better , but search operation in linked list is O(N) which is not good
so we will use linked hash map which gives search oepration in O(1) and in linked list insert and delete operation takes in O(1)
Every time we get a short url , we will see in cache if it is present we will move it to first which take O(1) and if not present in cache we will check in DB and add that url at top of list. if cache is full , we will remove from the end which are least recently used urls.
System design:
NO of servers for read > No of servers for write. As in this case we have high read to write ratio. But keeping different servers for read to write is fully choice. We can use single server as well Keeping these independent will help us to scale the service independently based on the demand.
Now does it handle Dos ? A potential security problem we could face is that malicious users send an overwhelmingly large number of URL shortening requests. Rate limiter helps to filter out requests based on IP address or other filtering rules.