Salesforce Trails: Apache NiFi Interview Questions

Apache NiFi Interview Questions For Freshers

1. What is Apache NiFi?

Apache NiFi is a free and open-source application that automates and manages data flow across systems. It is a secure and dependable data processing and distribution system that incorporates a web-based user interface for the purpose of creating, monitoring, and controlling data flows. It features a highly customizable and changeable data flow method that allows for real-time data modification.

2. What is the purpose of a NiFi Processor?

The Processor is a critical component of the NiFi since it is responsible for actually executing the FlowFile data and assists in producing, transmitting, receiving, converting, routing, dividing, integrating, and analyzing FlowFile.

3. What actually is a NiFi FlowFile?

A FlowFile is a file that contains signal, event, or user data that is pushed or generated in the NiFi. A FlowFile is mostly composed of two components. Its data and attributes. Attributes are key-value pairs that are associated with a piece of content or data.

4. Describe MiNiFi

MiNiFi is a project of NiFi that is intended to enhance the fundamental concepts of NiFi by emphasizing the gathering of data at its generation source. MiNiFi is meant to operate at the source, which is why it places a premium on minimal area and resource utilization.

5. Is it possible for a NiFi Flow file to contain complex data as well?

Yes, with NiFi, a FlowFile may include both organized (XML, JSON files) and complex (graphics) data.

6. What specifically is a Processor Node?

A Processor Node is a shell all around the Processor that manages the processor's state. The Processor Node is responsible for maintaining the

Positioning of processors in the graph.
Processor configuration characteristics.
Scheduling the processor's states.

7. What does the Reporting Task involve?

A Reporting Task is a NiFi expansion endpoint that is responsible for reporting and analyzing NiFi's inner statistics in order to transmit the data to other sources or to display status data straight in the NiFi UI.

8. Is the processor capable of committing or rolling back the session?

Yes, the processor is the module that may submit and reverse data via the session. When a Processor starts rolling back a session, all FlowFiles retrieved during the session are restored to their prior states. If, on the other hand, the Processor decides to submit the session, it will update the FlowFile repositories with the necessary information.

9. What does "Write-Ahead-Log" mean in the context of FlowFileRepository?

This implies that any changes made to the FlowFileRepository will be first logged and checked for consistency. Remain in the logs to avoid data loss, both before and during data processing, as well as checkpoints on a frequent basis to facilitate reversal.

10. Does the Reporting Task get access to the entire contents of the FlowFile?

No, a Reporting Task has no access to the contents of any specific FlowFile. Rather than that, a Reporting Task gets accessibility to all Provenance Events, alerts, and metrics associated with graph components, like Bits of data, read or written.

Apache NiFi Interview Questions For Experienced

11. What use does FlowFileExpiration serve?

It assists in determining when this FlowFile must be terminated and destroyed after a certain period of time. Assume you've set FlowFileExpiration to 1 hour. As soon as the FlowFile is detected in the NiFi platform, the countdown begins. Furthermore, once FlowFile reaches the connection, it will verify the age of the FlowFile; if it is older than 1 hour, the FlowFile will be ignored and destroyed.

12. What is the NiFi system's backpressure?

Occasionally, the producer system outperforms the consumer system. As a result, communications are slower. Hence, all unprocessed messages (FlowFiles) will stay in the network buffer. However, you may restrict the magnitude of the network backpressure depending on the number of FlowFiles or the quantity of the data. If it exceeds the set limit, the link will return pressure to the producing processor, causing it to stop running. As a result, no more FlowFiles are created until the backpressure is removed.

13. Is it possible to alter the settings of a processor while it is running?

No, the settings of the processor cannot be altered or modified while it is operating. You must first halt it and then allow for all FlowFile processing to complete. Then and only then may you modify the processor's settings.

14. What use does RouteOnAttribute serve?

RouteOnAttribute permits the system to make congestion control within the flow, allowing certain FlowFiles to be treated differently than others.

15. What Is The NiFi Template?

A template is a workflow that may be reused, which you may import and export across many NiFi instances. It can save a lot of time compared to generating Flow repeatedly. The template is produced in the form of an XML file.

16. What does the term "Provenance Data" signify in NiFi?

NiFi maintains a Data provenance library that contains all information about the FlowFile. As data continues to flow and is converted, redirected, divided, consolidated, and sent to various endpoints, all of this metadata is recorded in NiFi's Provenance Repository. Users can conduct a search for the processing of every single FlowFile.

17. What is a FlowFile's "lineageStartDate"?

This FlowFile property indicates the date and time the FlowFile was added or generated in the NiFi system. Even if a FlowFile is copied, combined, or divided, a child FlowFile may be generated. However, the lineageStartDate property will provide the timestamp for the ancestor FlowFile.

18. How to get data from a FlowFile's attributes?

Numerous algorithms are available, including ExtractText, and EvaluateXQuery which can help you get data from the FlowFile attribute. Furthermore, you may design your own customized microprocessor to meet the same criteria if no off-the-shelf processor is provided.

19. What occurs to the ControllerService when a DataFlow is used to generate a template?

When a template is produced via DataFlow and if it has an associated ControllerService, a new instance of the control system service will be generated throughout the import process.

20. What occurs if you save a passcode in a DataFlow and use it to generate a template?

A password is a very sensitive piece of information. As a result, when publishing the DataFlow as templates, the password is removed. Once you export the template into another NiFi system, whether the same or a different one, you must enter the password once more.

Apache NiFi FAQs

21. What is bulleting and how does it benefit NiFi?

While you can review the archives for anything noteworthy, having notifications come up on the board is far handier. If a Process records something as a WARNING, a "Bulletin Indicator" will appear in the Processor. This indication, which resembles a sticky note, will be displayed for five minutes following the occurrence of the event. If the bulletin is part of a cluster, it will additionally specify which network device released it. Furthermore, we may alter the log frequency at which bulletins are generated

22. What is a NiFi process group?

A process group can assist you in developing a sub-data stream that users can include in your primary data flow. The destination address and the input port are used to transmit and receive information from the process group, respectively.

23. What use does a flow controller serve?

The flow controller is the project's brain that allocates threads to run modifications and maintains the scheduling for when modules obtain resources to execute. The Flow Controller functions as the engine, determining when a thread is assigned to a specific processor.

24. How Does Nifi Handle Massive Payload Volumes in a Dataflow?

DataFlow can handle massive amounts of data. As data flows via NiFi, referred to as a FlowFile, is handed around, the FlowFile's information is only accessible when necessary.

25. What is the distinction between NiFi's FlowFile and Content repositories?

The FlowFile Library is where NiFi stores information about a particular FlowFile that is currently online in the stream.

The Content Repository stores the exact bytes of a FlowFile's information.

26. What does "deadlock in backpressure" imply?

Assume you're using a processor, such as PublishJMS, to release the information to the target list. The destination queue, on the other hand, is full, and your FlowFile will be sent to the failed relationship. And when you retry the unsuccessful FlowFile, the incoming backpressure linkage becomes full, which might result in a backpressure stalemate.

27. What is the remedy for the "back pressure deadlock"?

There are several alternatives, including

The administrator can temporarily boost the failed connection's backpressure level.
Another option to explore in this scenario is to have Reporting Tasks monitor the flow for big queues.

28. How does NiFi ensure the delivery of messages?

This is accomplished by implementing an effective permanent write-ahead logging and information repository.

29. Can you utilize the fixed Ranger setup on the HDP to work with HDF?

Yes, you may handle HDF with a single Ranger deployed on the HDP. Nevertheless, the Ranger that comes with HDP doesn't include NiFi service definition and must be installed manually.

30. Is NiFi capable of functioning as a master-slave design?

No, starting with NiFi 1.0, the 0-master principle is taken into account. Furthermore, each unit in the NiFi network is identical. The Zookeeper service manages the NiFi cluster. Apache ZooKeeper appoints a single point as the Cluster Administrator, and ZooKeeper handles redundancy seamlessly.

Advantages of Apache NiFi

The Advantages of Apache NiFi are as follows:

Apache NiFi offers a web-based User Interface (UI). So that it can run on a web browser using port and localhost.
On a web browser, Apache NiFi uses the HTTPS protocol to ensure secure user interaction.
It supports the SFTP protocol that enables data fetching from remote machines.
It also provides security policies at the process group level, user level, and other modules.
NiFi supports all the devices that run Java.
It provides real-time control that eases the movement of data between source and destination.
Apache NiFi supports clustering so that it can work on multiple nodes with the same flow processing different data, which increases the performance of data processing.
NiFi supports over 188 processors, and a user can create custom plugins to support various types of data systems.

Disadvantages of Apache NiFi

The following are the disadvantages of Apache NiFi.

Apache NiFi has a state persistence issue in the case of a primary node switch that makes processors unable to fetch data from source systems.
While making any change by the user, the node gets disconnected from the cluster, and then flow.xml gets invalid. The node cannot connect to the cluster till the admin copies the .xml file manually from the node.
To work with Apache NiFi, you must have good underlying system knowledge.
It offers a topic level, and SSL authorization might not be sufficient.
It is required to maintain a chain of custody for data.

Apache NiFi is a data flow system based on the concepts of Flow-based programming. It is developed by the National Security Agency (NSA), and then in 2015, it became an official part of the Apache Project Suite.

Every 6-8 weeks, Apache NiFi releases a new update to meet the requirements of the users.

This Apache NiFi tutorial is designed for beginners and professionals who are willing to learn the basics of Apache NiFi. It includes several sections that provide core knowledge of how to work with NiFi.

What is Apache NiFi?

Apache NiFi is a robust, scalable, and reliable system that is used to process and distribute data. It is built to automate the transfer of data between systems.

NiFi offers a web-based User Interface for creating, monitoring, and controlling data flows. NiFi stands for Niagara Files which was developed by National Security Agency (NSA) but now it is maintained by the Apache foundation.
Apache NiFi is a web-based UI platform where we need to define the source, destination, and processor for data collection, data storage, and data transmission, respectively.
Each processor in the NiFi has relations that are used while connecting one processor to another.

Why do we use Apache NiFi?

Apache NiFi is open-source; therefore, it is freely available in the market. It supports several data formats, such as social feeds, geographical locations, logs, etc.

Apache NiFi supports a wide variety of protocols such as SFTP, KAFKA, HDFS, etc. which makes this platform more popular in the IT industry. There are so many reasons to choose Apache NiFi. They are as follows.

· Apache NiFi helps organizations to integrate NiFi with their existing infrastructure.

· It allows users to make use of Java ecosystem functions and existing libraries.

· It provides real-time control that enables the user to manage the flow of data between any source, processor, and destination.

· It helps to visualize DataFlow at the enterprise level.

· It helps to aggregate, transform, route, fetch, listen, split, and drag-and-drop the data flow.

· It allows users to start and stop components at individual and group levels.

· NiFi enables users to pull the data from various sources to NiFi and allows them to create flow files.

· It is designed to scale out in clusters that provide guaranteed delivery of data.

· Visualize and monitor performance, and behavior in the flow bulletin that offers inline and insight documentation.

Features of Apache NiFi

The features of Apache NiFi are as follows:

· Apache NiFi is a web-based User Interface that offers a seamless experience of design, monitoring, control, and feedback.

· It even provides a data provenance module that helps to track and monitor data from the source to the destination of the data flow.

· Developers can create their customized processors and reporting tasks as per the requirements.

· It supports troubleshooting and flow optimization.

· It enables rapid development and testing effectively.

· It provides content encryption and communication over a secure protocol.

· It supports buffering of all queued data and provides an ability of backpressure as the queues can reach specified limits.

· Apache NiFi delivers a system to the user, the user to the system, and multi-tenant authentication security features.

Apache NiFi Architecture

Apache NiFi Architecture includes a web server, flow controller, and processor that runs on a Java Virtual Machine (JVM).

It has three repositories such as FlowFile Repository, Content Repository, and Provenance Repository.

Web Server

Web Server is used to host the HTTP-based command and control API.

Flow Controller

The flow controller is the brain of the operation. It offers threads for extensions to run and manage the schedule of when the extensions receive resources to run.

Extensions

Several types of NiFi extensions are defined in other documents. Extensions are used to operate and execute within the JVM.

FlowFile Repository

The FlowFile Repository includes the current state and attribute of each FlowFile that passes through the data flow of NiFi.

It keeps track of the state that is active in the flow currently. The standard approach is the continuous Write-Ahead Log which is located in a described disk partition.

Content Repository

The Content Repository is used to store all the data present in the flow files. The default approach is a fairly simple mechanism that stores blocks of data in the file system.

To reduce the contention on any single volume, specify more than one file system storage location to get different partitions.

Provenance Repository

The Provenance Repository is where all the provenance event data is stored. The repository construct is pluggable to the default implementation that makes use of one or more physical disk volumes.

Event data is indexed and searchable in each location.

From the NiFi 1.0 version, a Zero-Leader Clustering pattern is incorporated. Every node in the cluster executes similar tasks on the data but operates on a different set of data.

Apache Zookeeper picks a single node as a Cluster Coordinator. The Cluster Coordinator is used for connecting and disconnecting nodes. Also, every cluster has one Primary Node.

Key concepts of Apache NiFi

The key concepts of Apache NiFi are as follows:

Flow: Flow is created to connect different processors to share and modify data that is required from one data source to another destination.
Connection: Connection is used to connect the processors that act as a queue to hold the data in a queue when required. It is also known as a bounded buffer in Flow-based programming (FBP) terms. It allows several processes to interact at different rates.

Processors: The processor is a Java module that is used to either fetch data from the source system or to be stored in the destination system. Several processors can be used to add an attribute or modify the content in the FlowFile. It is responsible for sending, merging, routing, transforming, processing, creating, splitting, and receiving flow files.

FlowFile: FlowFile is the basic concept of NiFi that represents a single object of the data selected from the source system in NiFi. It allows users to make changes to Flowfile when it moves from the source processor to the destination. Various events such as Create, Receive, Clone, etc. that are performed on Flowfile using different processors in a flow.
Event: An event represents the modification in Flowfile when traversing by the NiFi Flow. Such events are monitored in the data provenance.
Data provenance: Data provenance is a repository that allows users to verify the data regarding the Flowfile and helps in troubleshooting if any issues arise while processing the Flow file.
Process group: The process group is a set of processes and their respective connections that can receive data from the input port and send it through output ports.

Input port

The input port is used to get data from the processor, which is not available in the process group. When the Input icon is dragged to the canvas, then it allows adding an Input port to the dataflow.

Output port

The output port is used to transfer data to the processor, which is not available in the process group. When the output port icon is dragged into the canvas, then it allows adding an output port.

Process Group

Process group helps to add process groups in NiFi canvas. When the Process Group icon is dragged into the canvas, it enables to enter the Process Group name, and then it is added to the canvas.

Funnel

The Funnel is used to send the output of the processor to various processors. Users can drag the Funnel icon into the canvas to add Funnel to the dataflow.

It allows adding a Remote Process Group in the NiFi canvas.

Template

The template icon is used to add the dataflow template to the NiFi canvas. It helps to reuse the data flow in the same or different instances.

After dragging, it allows users to select the existing template for the data flow.

Label

These are used to add text on NiFi canvas regarding any component available in the NiFi. It provides colors used by the user to add an aesthetic sense.

Processors Categorization in Apache NiFi

The following are the process categorization of Apache NiFi.

AWS Processors

AWS processors are responsible for communicating with the Amazon web services system. Such category processors are PutSNS, FetchS3Object, GetSQS,PutS3Object, etc.

Attribute Extraction Processors

Attribute Extraction processors are responsible for extracting, changing, and analyzing FlowFile attributes processing in the NiFi data flow.

Examples are ExtractText, EvaluateJSONPath, AttributeToJSON, UpdateAttribute, etc.

Database Access Processors

The Database Access processors are used to select or insert data or execute and prepare other SQL statements from the database.

Such processors use the data connection controller settings of Apache NiFi. Examples are PutSQL, ListDatabaseTables, ExecuteSQL, PutDatabaseRecord, etc.

Data Ingestion Processors

The Data Ingestion processors are used to ingest data into the data flow, such as a starting point of any data flow in Apache NiFi. Examples are GetFile, GetFTP, GetKAFKA,GetHTTP, etc.

Data Transformation Processors

Data Transformation processors are used for altering the content of the FlowFiles.

These can be used to replace the data of the FlowFile when the user has to send FlowFile as an HTTP format to invoke an HTTP processor. Examples are JoltTransformJSON ReplaceText, etc.

HTTP Processors

The HTTP processors work with the HTTP and HTTPS calls. Examples are InvokeHTTP, ListenHTTP, PostHTTP, etc.

Routing and Mediation Processors

Routing and Mediation processors are used to route the FlowFiles to different processors depending on the information in attributes of the FlowFiles.

It is responsible for controlling the NiFi data flows. Examples are RouteOnContent, RouteText, RouteOnAttribute, etc.

Sending Data Processors

Sending Data Processors are the end processors in the Data flow. It is responsible for storing or sending data to the destination.

After sending the data, the processor DROP the FlowFile with a successful relationship. Examples are PutKAFKA, PutFTP, PutSFTP, PutEmail, etc.

Splitting and Aggregation Processors

The Splitting and Aggregation processors are used to split and merge the content available in the Dataflow. Examples are SplitXML, SplitJSON, SplitContent, MergeContent, etc.

System Interaction Processors

The system interaction processors are used to run the process in any operating system. It also runs scripts in various languages with different systems.

Examples are ExecuteScript, ExecuteStreamCommand, ExecuteGroovyScript, ExecuteProcess, etc.

1. What is virtual hosting?

Virtual hosting is crucial for many web administrators, especially those who work for big companies with many websites. An interviewer might ask you this question to gauge your knowledge of Apache's functions and business applications. When answering this question, define the term and give an example of when to use virtual hosting.

Example: 'Virtual hosting is a process that allows you to run multiple websites from a single server. There are two kinds of virtual hosts. Name-based virtual hosting allows you to use a single IP address for multiple websites, while IP-based virtual hosting requires multiple IP addresses for the different sites. Both methods use a single server, and most of the time, an administrator uses name-based virtual hosting. You might use virtual hosting if your organisation has multiple websites connected to a single server, which can make data more secure and streamline the web administration process.'

2. Why is log analysis critical?

Web administrators use access log analysis to gain insights into a website's performance that they then share with marketing or sales teams. An interviewer may ask this question to confirm how familiar you are with the relationship between web development and other functions of a company. When answering this question, define the concept and give adequate examples.

Example: 'Analysing access logs, which are records of the requests users make to a server, can tell you about who visits a company's website and what they do when they get there. You can learn how many individual IP addresses visit your site, how often a specific IP address visits, which resources users click on and other pieces of data that can help you see which resources are most useful to visitors. If a website has a lot of traffic, you can use a log analysis program that compiles data from server access logs and organises it for you.'

3. How can you resolve a 503 HTTP error?

It is often the responsibility of IT experts to oversee the smooth running of an organisation's website and resolve server errors. An interviewer may ask questions to learn how you solve this type of problem. When answering, explain what a 503 HTTP error means and highlight the procedure to resolve it.

Example: 'A 503 HTTP error message means the server is unavailable at the time of the user's request. This might happen when the system is overloaded or when there is an issue with an application running on the server. The first step I take is to restart the server, which often fixes the issue. If restarting the server does not work, then I might look at the server logs to identify the specific action that caused the overload. I might also check to see if any of the website's systems started an automatic update, as this can cause a system issue.'

9. How do you create a custom processor in NiFi?

Writing custom processors helps you perform different operations to transform flow file content according to specific needs. Interviewers ask this question to understand your knowledge of using processors in NiFi. Mention the steps you use to create a custom processor.

Example: A custom processor is a special type of mainframe that users apply for a special purpose. I create a custom processor by extending the AbstractProcessor class and overriding the onTrigger method. When using the latter method, I first implement the logic for the processor. By accessing the flow file object, I read and write the data, and even use the NiFi ProcessorContext object to access different variables.

10. What is Bulletin?

Bulletin is a NiFi UI that provides information about occurrences, allowing developers to avoid going through log messages to find errors. Interviewers ask this question to understand the potential benefits of Bulletin. When answering, explain what it means and its benefits to developers.

Example: Bulletin is a NiFi UI that provides meaningful feedback and monitors the status of an application. When a processor records a “Warning”, a Bulletin indicator appears on the processor for almost five minutes after the event's occurrence. When the Bulletin is part of a cluster, it specifies which network released it. Developers can see the system-level Bulletins on the status bar near the top of the page. By hovering the mouse over the icon, they can see the time and severity of the Bulletin. They can even view and filter all Bulletins on the Bulletin Board Page.

NiFi Term	FBP Term	Description
FlowFile	Information Packet	A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.
FlowFile Processor	Black Box	Processors actually perform the work. In [eip] terms a processor is doing some combination of data Routing, Transformation, or Mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.
Connection	Bounded Buffer	Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues then can be prioritized dynamically and can have upper bounds on load, which enable back pressure.
Flow Controller	Scheduler	The Flow Controller maintains the knowledge of how processes actually connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.
Process Group	Subnet	A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner process groups allow creation of entirely new components simply by composition of other components.

10. Can you explain how to create a custom processor in NiFi?

You can create a custom processor in NiFi by extending the AbstractProcessor class and overriding the onTrigger method. In the onTrigger method, you will need to implement the logic for your processor. You can access the NiFi FlowFile object to read and write data, and you can also use the NiFi ProcessContext object to access properties and variables.

11. What is provenance in the context of NiFi? Why is it important?

Provenance is the history of a given piece of data, and it is important in NiFi because it allows you to track where data came from and how it has been processed. This is useful for debugging purposes, as well as for understanding the data flow through a NiFi system.

12. What information is captured by NiFi Provenance Repository?

The NiFi Provenance Repository captures information about the dataflow through NiFi, including the data that is processed, the NiFi processors that are used, the NiFi connections that are used, and the NiFi parameters that are used.

14. What is a Connection Queue in NiFi?

A Connection Queue is a queue of FlowFiles that are waiting to be processed by a downstream connection.

15. Can you tell me about the process used by NiFi to handle back pressure?

Back pressure is the name given to the process of slowing down or stopping the flow of data through a system when that system is becoming overwhelmed. This is done in order to prevent the system from becoming overloaded and crashing. NiFi uses a back pressure mechanism to automatically control the flow of data through the system in order to prevent data loss.

16. Is it possible to run NiFi as a cluster? If yes, then how?

Yes, it is possible to run NiFi as a cluster. In order to do so, you will need to start up multiple NiFi instances and then configure them to work together as a cluster. The specifics of how to do this will vary depending on your particular environment and setup.

What’s the difference between FlowFileAttributes and FlowFileContent?

FlowFileAttributes are metadata associated with a FlowFile, while FlowFileContent is the actual data contained in the FlowFile.

Can you explain how to create a custom processor in NiFi?

Apache NiFi Interview Questions